DG遇到ORA-04021报错后，LGWR被kill实例down

1.1问题现象

单实例DG环境，凌晨时刻，实例DOWN了，需要进行分析。

观察DB Alert日志，可以明确看到ORA-04021报错，以及打印LGWR TRACE日志。

1.2 报错ORA-04021???

How to Analyze Library Cache Timeout with Associated: ORA-04021 'timeout occurred while waiting to lock object %s%s%s%s%s.' Errors (Doc ID 1486712.1)

Why does ORA-4021 error occur?
Oracle instance contains a cache that holds the definitions of each different type of objects e.g. cursors, indexes, tables, views, procedures,
etc. This cache is called the library cache and contains a 'library' of object definitions. While these objects are in use, these definitions
cannot be changed. They are protected by a series of locks and pins called library cache locks and pins.

A session that needs to use an object will first attempt to acquire a library lock in a certain mode on the object (null, shared or exclusive).
If it gets the library cache lock, this will prevent other sessions from modifying it (e.g. an exclusive lock will be taken when recompiling a
package or view). If the lock is not available, it will wait until it can get it (i.e. when another session has stopped using it). Locking an
object is the first step and is essentially the task of locating the object in the library cache and then lock it in a certain mode.
If the session wants to modify or examine the object, it must also acquire a pin in a certain mode (again null, shared or exclusive) after
the lock has been taken.

Each SQL statement that wants to use/modify objects that are already locked or pinned (and whose lock/pin mode is incompatible with the requested
mode), will wait on events like 'library cache pin' or 'library cache lock' until it is freed, and it can get the lock. If it waits longer than
a pre-defined threshold, a timeout occurs. This is done to free up resources in the waiting session that might be blocking other sessions rather
than having a series of sessions that cannot move.

If it times out, then a ORA-4021 error is raised:

ORA-04021 timeout occurred while waiting to lock object %s%s%s%s%s.
Cause: While trying to lock a library object, a time-out occurred.
Action: Retry the operation later.
The timeout normally occurs after 5 minutes.

说了一堆，就是内存栓锁的争用，你需要修改一个对象的时候，如果一个对象正在执行，你将处于library cache lock等待。也就是说LGWR进程需要获取某些资源的时候，无法获取到，然后报错了！！！

1.3 问题分析

What type of objects can be locked in a library cache?
Following list give all objects that can be locked in the library cache:

- CURSOR 
- INDEX
- TABLE
- CLUSTER
- VIEW
- SYNONYM
- SEQUENCE
- PROCEDURE
- FUNCTION
- PACKAGE
- PACKAGE BODY
- TRIGGER
- TYPE
- TYPE BODY
- OBJECT
- USER
- DBLINK
- PIPE
- TABLE PARTITION
- INDEX PARTITION
- LOB
- LIBRARY
- DIRECTORY
- QUEUE
- INDEX-ORGANIZED TABLE
- REPLICATION OBJECT GROUP
- REPLICATION PROPAGATOR

ORA-04021: timeout occurred while waiting to lock object : DR Instance terminated by LGWR (Doc ID 2183882.1)

根据读取ORA-04021的报错解释说明，我们可以理解为LGWR需要获取某些内存栓锁，无法获取到，因此报错。但是我们除了分析问题，还要解决问题，这个问题如何处理？

继续查询MOS,检索得到上述MOS文档，观察函数堆栈，还是匹配度很高的。

opirip ==> oracle program interface run independent process. initializes fixed-PGA in RAM and SQL layers. 
opidrv ==> oracle program interface route current request driver, entry side into two task interface 
ksbrdp ==> kernel service background processes run a detached background process 
ksuitm ==> kernel service user ?? 
kjzdicrshnfy ==> kernel lock management DIAG main layer ?? 
kjzduptcctx ==> kernel lock management DIAG main layer ?? 
ssthrdmain|main for every thread in a threaded oracle
main ==> general c starting function

还是缺少部分资料，上述函数的说明只能知道申请PGA一块区域后，执行操作，无法获取后续进入打印TRACE环节。

CAUSE
Bug 16717701 - ADG SHOULD GET THE INSTANCE PARSE LOCK WITH A TIMEOUT  ------> Superseded By Bug fix Bug 17018214

Bug 11712267 - ACTIVE DATA GUARD DATABASE HUNG ON 'LIBRARY CACHE: MUTEX X' WAIT EVENT

1.4 问题处理

Issue matches with bug 11712267 and bug 16717701

Since two bugs are matching with the case,

You can try with option (1) . As per Bug 11712267

change the cursor_sharing to force on Active dataguard (ADG).

Monitor your environment for sometime.

If it crashes again then follow with the option (2)
Option (2):

As per bug description

LGWR can request DBINSTANCE lock in X mode without any timeout which can lead to a hang / deadlock.

Both fixes are already included in 11.2.0.4 but the fix is DISABLED by default.
== > To ENABLE the fix one has to set == > "_adg_parselock_timeout" > to the number of centi-seconds == > LGWR should wait
 before backing off and retrying the request.

Value should be in centi seconds. == > I Don't think there is really any hard fast rule for a value - at default (0) it will not timeout.
A value representing a few seconds seems reasonable - if LGWR has been stuck for say 5 seconds waiting it seems reasonable 
guess it is not going to get the lock.

The param just causes it to abort the current attempt and retry If you want to play safe can start with a higher value then decrease later.
A higher value will just mean more sessions blocked for longer in case of the deadlock situation.
500 Seems reasonable , but I have no data to base it on.

There should be a statistic "ADG parselock X get attempts" If it gets set too small that value would likely
 increase a lot due to keep timing out and retrying.

This is a dynamic parameter

Follow option (1) . 

change the cursor_sharing to force on ADG


If issue re-appears then follow option (2) as below

 

Please set "_adg_parselock_timeout" to 500 == >

SQL > alter system set "_adg_parselock_timeout"=500 scope=both sid='*';

最终是参考上述MOS

第一修改 cursor_sharing 参数为Force，目的是什么，是为了让硬解析的SQL大部分转为软解析，降低shared pool内存资源，以及内存栓锁及CPU的争用；

第二修改_adg_parselock_timeout 为500，及LGWR获取内存栓锁默认是0，不允许延迟，设置为5s,允许性能极差或资源争用的情况下，延长LGWR的申请时间。

虽然处理后，后续DG没有down. 但是我们仍然需要分析DG的问题！！！

通过配置Crontab的定时任务，调用查询session event的信息及SQL进行捕获。我们看到了关于内存及IO资源的EVENT事件。

free buffer wait/ standby logfile file asynch 因此，后续我们的目标时对这个单实例DG，进行操作系统内存扩容，IO这块将挂载一块SSD存储至操作系统，将归档及日志路径切换至SSD磁盘对应操作系统挂载点。

从原理去理解，如果内存不够用的情况下，SGA的内存中，Buffer cache与Shared pool进行争用，会让Shared pool变小，从而进一步加剧shared pool=Library cache+dictionay cache直接的争用。由于无法掌握更具体的信息，基本上将按照上述方法进一步优化调整DG情况。