[ORA_NPI_ERROR][ORA-00600 [kffilCreate01], [vipnew3:vipnew] 问题分析

一、问题现象

某客户反馈，一套系统晚上数据库报ORA-600后,DB down挂了,重启后业务正常，需要分析一下问题。

参考

http://www.killdb.com/2015/10/07/mapid%E6%BA%A2%E5%87%BA%E5%AF%BC%E8%87%B4oracle-rac-%E8%8A%82%E7%82%B9%E9%87%8D%E5%90%AF/
Bug 16357438 : ORA-600 [KFFILCREATE01] IN ASM TRIGGERS DB CRASH WHEN MAPID REACHES MAX VALUE

二、问题分析

吐槽一下，这个问题和上面参考文档大佬roger的文档几乎一模一样，感谢前面有大佬们的资料进行学习，但是本着技术积累的原则，博客还是梳理一下！

2.1 Db alert日志

说明这个时间点DB正常，还是正常往DG发送日志。
Thu Mar 25 22:15:25 2021
ARC5: Standby redo logfile selected for thread 3 sequence 57833 for destination LOG_ARCHIVE_DEST_2
Thu Mar 25 22:22:27 2021
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
NOTE: ASMB terminating
Errors in file /opt/app/oracle/diag/rdbms/vipnew/vipnew3/trace/vipnew3_asmb_17197.trc:
ORA-15064: communication failure with ASM instance
ORA-00600: internal error code, arguments: [ORA_NPI_ERROR], [600],
 [ORA-00600: internal error code, arguments: [kffilCreate01], [vipnew3:vipnew], [], [], [], [], [], [], [], [], [], []
], [], [], [], [], [], [], [], [], []
Errors in file /opt/app/oracle/diag/rdbms/vipnew/vipnew3/trace/vipnew3_asmb_17197.trc:
ORA-15064: communication failure with ASM instance
ORA-00600: internal error code, arguments: [ORA_NPI_ERROR], [600],
 [ORA-00600: internal error code, arguments: [kffilCreate01], [vipnew3:vipnew], [], [], [], [], [], [], [], [], [], []
], [], [], [], [], [], [], [], [], []
ASMB (ospid: 17197): terminating the instance due to error 15064
Thu Mar 25 22:22:27 2021
System state dump requested by (instance=3, osid=17197 (ASMB)), summary=[abnormal instance termination].
System State dumped to trace file /opt/app/oracle/diag/rdbms/vipnew/vipnew3/trace/vipnew3_diag_17137.trc
Thu Mar 25 22:22:28 2021
opiodr aborting process unknown ospid (4595) as a result of ORA-1092
Thu Mar 25 22:22:28 2021
ORA-1092 : opitsk aborting process
Thu Mar 25 22:22:28 2021
······
关注黑体，ASMB!  由于错误15064而终止实例
ora-15064 错误  communication failure with ASM instance  与ASM实例通信失败

梳理一下，Oracle实例与ASM实例通讯失败，然后Oracle DB实例down了！

疑问？ ASMB进程干啥的？它咋那么nb，通讯失败就把Oracle实例搞down了？

http://blog.itpub.net/8494287/viewspace-1444313/

ASM实例与数据库实例进行通信的桥梁是ASMB进程，此进程运行在每个数据库实例上，是两个实例间信息交换的通道。ASMB进程先利用磁盘组名称通过CSS获得管理该磁盘组的ASM实例连接串，然后建立一个到ASM的持久连接，这样两个实例之间就可以通过这条连接定期交换信息，同时这也是一种心跳监控机制。

Every database instance that uses ASM for file storage will also need two new processes. The Rebalancer background process (RBAL) handles global opens of all ASM disks in the ASM Disk Groups, while the ASM Bridge process (ASMB) connects as a foreground process into the ASM instance when the regular database instance starts. ASMB facilitates communication between the ASM instance and the regular database, including handling physical file changes like data file creation and deletion.

ASMB exchanges messages between both servers for statistics update and instance health validation. These two processes are automatically started by the database instance when a new Oracle file type - for example, a tablespace's datafile -- is created on an ASM disk group. When an ASM instance mounts a disk group, it registers the disk group and connect string with Group Services. The database instance knows the name of the disk group, and can therefore use it to locate connect information for the correct ASM instance.

https://blog.csdn.net/liqfyiyi/article/details/8036009

每个使用ASM进行文件存储的数据库实例也需要两个新进程。Rebalancer后台进程（RBAL）处理ASM磁盘组中所有ASM磁盘的全局打开，而ASM桥接进程（ASMB）在常规数据库实例启动时作为前台进程连接到ASM实例。ASMB促进ASM实例和常规数据库之间的通信，包括处理物理文件更改，如数据文件创建和删除。

ASMB在两台服务器之间交换消息以进行统计更新和实例运行状况验证。当在ASM磁盘组上创建新的Oracle文件类型（例如表空间的数据文件）时，数据库实例会自动启动这两个进程。ASM实例装载磁盘组时，它会注册磁盘组并将字符串与组服务连接。数据库实例知道磁盘组的名称，因此可以使用它来查找正确ASM实例的连接信息。

懂了，ASMB进程是Oracle数据库启动，Oracle实例启动ASMB进程，并且作为ASM实例的前台进程(一个会话)，进行实例之间的交互！

2.2 ASM Alert

2020-05-19 23:18:06.062
[/opt/app/grid/product/11.2.0/crs_1/bin/oraagent.bin(6691)]CRS-5013:Agent "/opt/app/grid/product/11.2.0/crs_1/bin/oraagent.bin" 
failed to start process "/opt/app/grid/product/11.2.0/crs_1/bin/lsnrctl" for action "check": details at "(:CLSN00008:)"
 in "/opt/app/grid/product/11.2.0/crs_1/log/zyk-rac3/agent/crsd/oraagent_grid/oraagent_grid.log"
[client(3523)]CRS-10001:12-Nov-20 20:39 ACFS-9203: true
2021-03-25 22:22:28.499
[/opt/app/grid/product/11.2.0/crs_1/bin/oraagent.bin(17001)]CRS-5011:Check of resource "vipnew" failed: details at "(:CLSN00007:)"
 in "/opt/app/grid/product/11.2.0/crs_1/log/zyk-rac3/agent/crsd/oraagent_oracle/oraagent_oracle.log"
2021-03-25 22:22:28.650
[crsd(16606)]CRS-2765:Resource 'ora.vipnew.db' has failed on server 'zyk-rac3'.
2021-03-25 22:22:34.409
[crsd(16606)]CRS-2765:Resource 'ora.vipnew.vipnew03.svc' has failed on server 'zyk-rac3'.
2021-03-25 22:22:34.409
[crsd(16606)]CRS-2771:Maximum restart attempts reached for resource 'ora.vipnew.vipnew03.svc'; will not restart.
没啥信息，这部分比较抱歉，没有具体的trace相关信息。

2.3 观察Trace日志

Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
System name:    Linux
Instance name: +ASM3
Machine:    x86_64

Incident 533424 created, dump file: /opt/app/oracle/diag/asm/+asm/+ASM3/incident/incdir_533424/+ASM3_ora_17201_i533424.trc
ORA-00600: internal error code, arguments: [kffilCreate01], [vipnew3:vipnew], [], [], [], [], [], [], [], [], [], []


----- Incident Context Dump -----
Address: 0x7fffb73a1010
Incident ID: 533424
Problem Key: ORA 600 [kffilCreate01]
Error: ORA-600 [kffilCreate01] [vipnew3:vipnew] [] [] [] [] [] [] [] [] [] []
[00]: dbgexProcessError [diag_dde]
[01]: dbgeExecuteForError [diag_dde]
[02]: dbgePostErrorKGE [diag_dde]
[03]: dbkePostKGE_kgsf [rdbms_dde]
[04]: kgeadse []
[05]: kgerinv_internal []
[06]: kgerinv []
[07]: kgeasnmierr []
[08]: kffilCreate [ASM]<-- Signaling
[09]: kfnsUFG [KFNU]
[10]: kfnsBackground [KFNU]
[11]: kfnDispatch [KFN]
[12]: opiodr []
[13]: ttcpip []
[14]: opitsk []
[15]: opiino []
[16]: opiodr []
[17]: opidrv []
[18]: sou2o []
[19]: opimai_real []
[20]: ssthrdmain []
[21]: main []
[22]: __libc_start_main []
[23]: _start []



(kff)ilCreate - kernel automatic storage management KFF ??  这个不一定准确！

kgeasnmierr ==> kernel generic error assert named internal error 
  内核通用错误断言名为internal error

kgerinv ==> kernel generic error record internal named error with va_list

内核通用错误记录内部命名错误和列表

但是观察函数调用kgeasnmierr 基本上已经是错误堆栈的函数调用了，那么它的前一个函数就是最后调用的！

意味着Oracle进程调用函数跑的好好地，执行调用kffilCreate发送了问题异常，随后进入异常处理部分！

# cat +ASM3_ora_17201_i533424.trc|grep mapid
(kffil) netnm: <null>, mapid: <null>
(kffil) netnm: vipnew3:vipnew, mapid: 4294967294
(kffil) netnm: vipnew3:vipnew, mapid: 4294967293
(kffil) netnm: vipnew3:vipnew, mapid: 4294967286
(kffil) netnm: vipnew3:vipnew, mapid: 4294967284
(kffil) netnm: vipnew3:vipnew, mapid: 4294967268
(kffil) netnm: vipnew3:vipnew, mapid: 4294967267
(kffil) netnm: vipnew3:vipnew, mapid: 4294967264
(kffil) netnm: vipnew3:vipnew, mapid: 4294909598
(kffil) netnm: vipnew3:vipnew, mapid: 4294909558

······

(kffil) netnm: vipnew3:vipnew, mapid: 95
(kffil) netnm: vipnew3:vipnew, mapid: 3

······

(kffil) netnm: vipnew3:vipnew, mapid: 5
(kffil) netnm: vipnew3:vipnew, mapid: 4
(kffil) netnm: vipnew3:vipnew, mapid: 2
2021-03-25 22:22:24.927789 :BD6AA7DB:db_trace:kfns.c@5381:kfnsFillMapMsg(): [10491:32:1985] MAP_LOAD: gn=3 fn=695, mapid=-6 start=60 cnt=441
2021-03-25 22:22:24.928051 :BD6AA81A:db_trace:kfns.c@5381:kfnsFillMapMsg(): [10491:32:1985] MAP_LOAD: gn=3 fn=291, mapid=-5 start=60 cnt=441
2021-03-25 22:22:24.929067 :BD6AA8CA:db_trace:kfns.c@5341:kfnsFillMapMsg(): [10491:32:1985] MAP_FREE: gn=3 fn=695, mapid=-8
2021-03-25 22:22:24.929662 :BD6AA914:db_trace:kfns.c@5341:kfnsFillMapMsg(): [10491:32:1985] MAP_FREE: gn=3 fn=291, mapid=-5

存在负数的 mapid=-6，时间点是2021-03-25 22:22:24， Oracle DB 报通讯失败的时间点是Thu Mar 25 22:22:27 2021

三、如何解决

1.出现这个问题，基本上DB已经down了，手工启动ASM实例，DB实例。或者重启故障的节点集群就行；

2.后续避免出现这个问题：

Bug 16357438  ORA-600 [kffilCreate01] in ASM
Fixed: >=11.2.0.4 (Server Patch Set)      升级到11.2.0.4 直接fixed，不存在这个问题了，Oralce内部解决了这个Bug异常。

11.2.0.3 版本如何处理？
Oracle提供bug对应的补丁,但是需要 Oracle db的psu版本为11.2.0.3.12 ！！！ 
https://updates.oracle.com/Orion/Services/download?type=readme&aru=18073013

Oracle Database 11g Release 11.2.0.3.12

11g Patch for 16357438 for Linux-x86-64 Platforms

1.	Ensure that the Oracle home on which you are installing the patch or from which you are rolling back the patch is
 Oracle Database 11g Release 11.2.0.3.12.
 
2.	Ensure that 11gg Release 11.2.0.3.12 Patch Set Update (PSU) 19121548 is already applied on the Oracle Database.