12.2 关闭DLM 自动收集统计信息 (SCM0)ORA-00600之[ksliwat: bad wait time]

一、报错日志

db_alert

ORA-00600: ??????, ??: [ksliwat: bad wait time], [18446744073709471616], [], [], [], [], [], [], [], [], [], []

Use ADRCI or Support Workbench to package the incident.

See Note 411.1 at My Oracle Support for error and packaging details.

Dumping diagnostic data in directory=[cdmp_20190223052024], requested by (instance=1, osid=63791 (SCM0)), summary=[incident=524425].

trace

RDBMS_12.2.0.1.0_LINUX.X64_170125

Unix process pid: 63791, image: oracle@pelpsrdb01 (SCM0)

ORA-00600: ??????, ??: [ksliwat: bad wait time], [18446744073709471616], [], [], [], [], [], [], [], [], [], []

2019-02-23 05:20:23.965 :kjsc_main(): error SCM0

OPIRIP: Uncaught error 447. Error stack:

ORA-00447: fatal error in background process

ORA-00600: internal error code, arguments: [ksliwat: bad wait time], [18446744073709471616], [], [], [], [], [], [], [], [], [], []

函数

ksedst()+119 call kgdsdst()

dbkedDefDump()+1200 call ksedst()

ksedmp()+259 call dbkedDefDump()

dbgexPhaseII()+2130 call ksedmp()

dbgexProcessError()+2531 call dbgexPhaseII()

dbgePostErrorKGE()+1767 call dbgexProcessError()

dbkePostKGE_kgsf()+90 call dbgePostErrorKGE()

kgeadse()+477 call dbkePostKGE_kgsf()

kgerinv_internal()+49 call kgeadse()

kgerinv()+40 call kgerinv_internal()

kgeasnmierr()+150 call kgerinv()

ksliwat()+15035 call kgeasnmierr()

kslwaitctx()+197 call ksliwat()

kjsc_main()+1431 call kslwaitctx()

ksvrdp_int()+2010 call kjsc_main()

······

二、问题分析

1.信息汇总
1）数据库版本12.2.0.1
2）报错信息ORA-600 [ksliwat: bad wait time] [18446744073709471616]
3）相关进程kjsc_main(): error SCM0
ORA-00447: fatal error in background process
4）函数名称
5）目的分析ORA-600报错的影响


2.信息查询
函数说明
函数名称    描述
ksliwat    kernel service lock management inner wait function; setup a wait that times out
kslwaitctx    kslwaitctx|wait context; wait until timeout
kjsc_main    kernel lock management RAC global stats
ksvrdp_int    kernel service (VOS) slave management run generic detached slave process
全局内核锁状态管理，触发超时等待，随后报错

3.SCM0报错进程描述

Name	Expanded Name	Short Description	Long Description	External Properties
SCM0	DLM Statistics Collection and Management Slave	Collects and manages statistics related to global enqueue service (GES) and global cache service (GCS)	The DLM Statistics Collection and Management slave (SCM0) is responsible for collecting and managing the statistics related to global enqueue service (GES) and global cache service (GCS). This slave exists only if DLM statistics collection is enabled.	Database instances

DLM统计收集和管理从（SCM0）负责收集和管理与全局排队服务（GES）和全局缓存服务（GCS）相关的统计信息。仅当启用了DLM统计信息收集时，此从机才存在。

DML统计信息收集参数查询：

SYS@ora122>select a.ksppinm,b.ksppstvl,a.ksppdesc from x$ksppi a,x$ksppcv b where (a.indx=b.indx) and a.ksppinm like '%_dlm_stats_collect%';

KSPPINM KSPPSTVL KSPPDESC

--------- -------------------- ------------------------------

_dlm_stats_collect 1 DLM statistics collection(0 = disable (default), 1 = enable)

_dlm_stats_collect_mode 0 DLM statistics collection mode

_dlm_stats_collect_slot_interval 60 DLM statistics collection slot interval (in seconds)

_dlm_stats_collect_du_limit 3000 DLM statistics collection disk updates per slot

三、问题处理

  MOS文档截取
12.2 RAC DB Background process SCM0 consuming excessive CPU (文档 ID 2373451.1)
The DLM Statistics Collection and Management slave (SCM0) is responsible for collecting and managing the 
statistics related to global enqueue service (GES) and global cache service (GCS). This slave exists only if
 DLM statistics collection is enabled.

The value is set to 1. Please go ahead and run the following command to change the value of _dlm to 0:
alter system set "_dlm_stats_collect" = 0 scope = spfile sid = '*';
This does require a reboot for changes to take effect. If a reboot is not an option, as a workaround 
you may kill the SCM0 process at OS level, it will respawn a new process soon.
kill -9 <os pid of SCM0>

Note: Disabling dlm_stats_collect (ie setting to 0) has no negative effect in 12.2. 
 This is because the stats are not yet used in 12.2 
(the features that would use these stats service based affinity and cache warmup are also disabled in 12.2 by default). 
Versions 18 and 19 may have them enabled, so re-evaluate at that time.

经过分析，该600报错是 SCM0进程收集GES、GCS的相关的统计信息时间超长后抛出异常，根据MOS文档ID 2373451.1说明，SCM0收集的信息在12.2版本无价值，

因此本次报错可忽略或者禁用DLM收集统计信息参数。

补充说明：（感谢在网上留下知识的前辈们，不然都不知道这些东西是干啥的）

DLM描述说明
分布式锁管理器（distributed lock management DLM），简单说对于RAC环境，所有数据的修改，都需要事先以节点为单位，去DLM申请节点对块的修改权限，DLM对块的资源进行多节点修改进行协调。
SCM0进程即DLM数据的采集以及管理进程
SCM0简短描述
收集和管理与全局排队服务（GES）和全局缓存服务（GCS）相关的统计信息

什么是GES，GCS
全局队列服务(GES)：主要负责维护字典缓存和库缓存内的一致性。字典缓存是实例的SGA内所存储的对数据字典信息的缓存，用于高速访问。由于该字典信息存储在内存中，因而在某个节点上对字典进行的修改（如DDL)必须立即被传播至所有节点上的字典缓存。GES负责处理上述情况，并消除实例间出现的差异。处于同样的原因，为了分析影响这些对象的SQL语句，数据库内对象上的库缓存锁会被去掉。这些锁必须在实例间进行维护，而全局队列服务必须确保请求访问相同对象的多个实例间不会出现死锁。LMON、LCK和LMD进程联合工作来实现全局队列服务的功能。GES是除了数据块本身的维护和管理（由GCS完成）之外，在RAC环境中调节节点间其他资源的重要服务。
GES控制数据库中所有的 library cache锁和dictionary cache锁。这些资源在单实例数据库中是本地性的，但是到了RAC群集中变成了全局资源。全局锁也被用来保护数据的结构，进行事务的管理。一般说来，事务和表锁在RAC环境或是单实例环境中是一致的。

GCS 是oracle 用来实施Cache fusion的机制。被GCS 和GES 管理的块和锁叫做资源。对这些资源的访问必须在群集的多个实例中进行协调。这个协调在实例层面和数据库层面都有发生。实例层次的资源协调叫做本地资源协调；数据库层次的协调叫做全局资源协调。
本地资源协调的机制和单实例oracle的资源协调机制类似，包含有块级别的访问，空间管理，dictionary cache、library cache管理，行级锁，SCN 发生。全局资源协调是针对RAC的，使用了SGA中额外的内存组件、算法和后台进程。

12.2 关闭DLM 自动收集统计信息 (SCM0)ORA-00600之[ksliwat: bad wait time]

一、报错日志

二 、问题分析

三、问题处理

二、问题分析