超融合硬件损坏导致Oracle RAC异常恢复实录

墨墨导读:一套Oracle RAC环境运行在HW超融合环境中,由于硬件问题导致数据库crash,期间出现了不少数据坏块,本文详述整个恢复过程,希望对大家有帮助。

前几天某客户遇到一个棘手问题:其一套Oracle RAC环境运行在HW超融合环境中,由于硬件问题导致数据库crash,期间出现了不少数据坏块,不过还好客户有RMAN物理备份,因此客户提前进行了全库Restore。

首先我们来看下相关日志信息:

Trying mirror side DATADG_0002.
Reread of blocknum=145005, file=+DATADG/xxxxx/datafile/supervise. found same corrupt data
Reread of blocknum=145005, file=+DATADG/xxxxx/datafile/supervise. found valid data
Sun Apr 26 01:22:35 2020
Hex dump of (file 11, block 230454) in trace file /u01/app/oracle/diag/rdbms/xxxxx/xxxxx1/trace/xxxxx1_ora_13254.trc
Corrupt block relative dba: 0x02c38436 (file 11, block 230454)
Bad header found during backing up datafile
Data in bad block:
type: 6 format: 2 rdba: 0x0082f736
last change scn: 0x0000.4b5a95d7 seq: 0x1 flg: 0x06
spare1: 0x0 spare2: 0x0 spare3: 0x0
consistency value in tail: 0x95d70601
check value in block header: 0xa3d1
computed block checksum: 0x0
Trying mirror side DATADG_0002.
Reread of blocknum=230454, file=+DATADG/xxxxx/datafile/supdb_etyy.dbf. found same corrupt data
Reread of blocknum=230454, file=+DATADG/xxxxx/datafile/supdb_etyy.dbf. found valid data
Hex dump of (file 11, block 230458) in trace file /u01/app/oracle/diag/rdbms/xxxxx/xxxxx1/trace/xxxxx1_ora_13254.trc
Corrupt block relative dba: 0x02c3843a (file 11, block 230458)

上述类似的大量坏块信息,最终导致数据库挂掉。

我们可以看到客户这里asm diskgroup为normal冗余;当primary extent数据有问题时,Oracle会尝试从mirror extent去获取;如果mirror extent是正常的,那么Oracle会自动进行修复,否则会导致数据丢失,严重的话会导致数据库宕机。

Mon Apr 27 02:01:15 2020
Read of datafile '+DATADG/orclsk/datafile/sysaux.260.956077231' (fno 2) header failed with ORA-01207
Rereading datafile 2 header from mirror side 'DATADG_0000' failed with ORA-01207
Rereading datafile 2 header from mirror side 'DATADG_0006' failed with ORA-01207
Errors in file /u01/app/oracle/diag/rdbms/orclsk/orclsk1/trace/orclsk1_ckpt_4893.trc:
ORA-63999: data file suffered media failure
ORA-01122: database file 2 failed verification check
ORA-01110: data file 2: '+DATADG/orclsk/datafile/sysaux.260.956077231'
ORA-01207: file is more recent than control file - old control file
Errors in file /u01/app/oracle/diag/rdbms/xxxxx/xxxxx1/trace/xxxxx1_ckpt_4893.trc:
ORA-63999: data file suffered media failure
ORA-01122: database file 2 failed verification check
ORA-01110: data file 2: '+DATADG/xxxxx/datafile/sysaux.260.956077231'
ORA-01207: file is more recent than control file - old control file
CKPT (ospid: 4893): terminating the instance due to error 63999

因为部分文件需要介质恢复(因为primary和mirror 数据都异常),而数据库强行终止了实例。

数据库实例重启后则开始出现大量错误:

Mon Apr 27 02:02:03 2020
Read of datafile '+DATADG/xxxxx/datafile/mscsz.dbf' (fno 40) header failed with ORA-01207
Rereading datafile 40 header from mirror side 'DATADG_0000' failed with ORA-01207
Rereading datafile 40 header from mirror side 'DATADG_0005' failed with ORA-01207
Errors in file /u01/app/oracle/diag/rdbms/xxxxx/xxxxx1/trace/xxxxx1_dbw0_115709.trc:
ORA-01186: file 40 failed verification tests
ORA-01122: database file 40 failed verification check
ORA-01110: data file 40: '+DATADG/xxxxx/datafile/mscsz.dbf'
ORA-01207: file is more recent than control file - old control file
File 40 not verified due to error ORA-01122
.......
Mon Apr 27 02:02:07 2020
QMNC started with pid=46, OS id=115887 
Errors in file /u01/app/oracle/diag/rdbms/xxxxx/xxxxx1/trace/xxxxx1_ora_115771.trc  (incident=299881):
ORA-01578: ORACLE data block corrupted (file # 1, block # 3620)
ORA-01110: data file 1: '+DATADG/xxxxx/datafile/system.259.956077225'
Incident details in: /u01/app/oracle/diag/rdbms/xxxxx/xxxxx1/incident/incdir_299881/xxxxx1_ora_115771_i299881.trc
Mon Apr 27 02:02:16 2020
Dumping diagnostic data in directory=[cdmp_20200427020216], requested by (instance=1, osid=115771), summary=[incident=299881].
Mon Apr 27 02:02:17 2020
Sweep [inc][299881]: completed
Mon Apr 27 02:02:18 2020
Completed: ALTER DATABASE OPEN /* db agent *//* {0:13:1899} */
Mon Apr 27 02:02:18 2020
ORA-21700 encountered when generating server alert SMG-4120
ORA-21700 encountered when generating server alert SMG-4121
ORA-21700 encountered when generating server alert SMG-4121
Mon Apr 27 02:02:20 2020
Starting background process CJQ0
Mon Apr 27 02:02:20 2020
CJQ0 started with pid=54, OS id=115990 
Mon Apr 27 02:02:20 2020
Errors in file /u01/app/oracle/diag/rdbms/xxxxx/xxxxx1/trace/xxxxx1_m004_115984.trc  (incident=300025):
ORA-01578: ORACLE data block corrupted (file # 2, block # 101561)
ORA-01110: data file 2: '+DATADG/xxxxx/datafile/sysaux.260.956077231'
Incident details in: /u01/app/oracle/diag/rdbms/xxxxx/xxxxx1/incident/incdir_300025/xxxxx1_m004_115984_i300025.trc
Dumping diagnostic data in directory=[cdmp_20200427020222], requested by (instance=1, osid=115984 (M004)), summary=[incident=300025].
Mon Apr 27 02:02:22 2020
Sweep [inc][300025]: completed
Errors in file /u01/app/oracle/diag/rdbms/xxxxx/xxxxx1/trace/xxxxx1_m004_115984.trc  (incident=300026):
ORA-01578: ORACLE data block corrupted (file # , block # )
Incident details in: /u01/app/oracle/diag/rdbms/xxxxx/xxxxx1/incident/incdir_300026/xxxxx1_m004_115984_i300026.trc
Errors in file /u01/app/oracle/diag/rdbms/xxxxx/xxxxx1/trace/xxxxx1_m004_115984.trc  (incident=300027):
ORA-01578: ORACLE data block corrupted (file # , block # )
ORA-06512: at "SYS.PRVT_ADVISOR", line 6144
ORA-06512: at line 1
Incident details in: /u01/app/oracle/diag/rdbms/xxxxx/xxxxx1/incident/incdir_300027/xxxxx1_m004_115984_i300027.trc
Dumping diagnostic data in directory=[cdmp_20200427020224], requested by (instance=1, osid=115984 (M004)), summary=[incident=300026].
Dumping diagnostic data in directory=[cdmp_20200427020224], requested by (instance=1, osid=115984 (M004)), summary=[incident=300027].
Mon Apr 27 02:02:46 2020
Hex dump of (file 1, block 24895) in trace file /u01/app/oracle/diag/rdbms/xxxxx/xxxxx1/trace/xxxxx1_ora_116134.trc
Corrupt block relative dba: 0x0040613f (file 1, block 24895)
Bad header found during buffer read
Data in bad block:
type: 6 format: 2 rdba: 0x00405c3f
last change scn: 0x0000.0002a412 seq: 0x1 flg: 0x06
spare1: 0x0 spare2: 0x0 spare3: 0x0
consistency value in tail: 0xa4120601
check value in block header: 0xa486
computed block checksum: 0x0
Reading datafile '+DATADG/xxxxx/datafile/system.259.956077225' for corruption at rdba: 0x0040613f (file 1, block 24895)
Read datafile mirror 'DATADG_0001' (file 1, block 24895) found same corrupt data (no logical check)
Read datafile mirror 'DATADG_0003' (file 1, block 24895) found valid data
Hex dump of (file 1, block 24895) in trace file /u01/app/oracle/diag/rdbms/xxxxx/xxxxx1/trace/xxxxx1_ora_116134.trc
Repaired corruption at (file 1, block 24895)

不难看出数据库控制文件和system都出现了异常,这也难怪数据库无法正常打开。

期间还出现了ora-00600 [4193]等常见错误:

Mon Apr 27 02:55:26 2020
QMNC started with pid=44, OS id=3072 
Mon Apr 27 02:55:26 2020
Errors in file /u01/app/oracle/diag/rdbms/xxxxx/xxxxx1/trace/xxxxx1_mmon_2939.trc  (incident=323867):
ORA-00600: internal error code, arguments: [4193], [], [], [], [], [], [], [], [], [], [], []
Incident details in: /u01/app/oracle/diag/rdbms/xxxxx/xxxxx1/incident/incdir_323867/xxxxx1_mmon_2939_i323867.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Completed: ALTER DATABASE OPEN /* db agent *//* {0:17:166} */
ARC3: Archival started
ARC0: STARTING ARCH PROCESSES COMPLETE
Mon Apr 27 02:55:29 2020
Dumping diagnostic data in directory=[cdmp_20200427025529], requested by (instance=1, osid=2939 (MMON)), summary=[incident=323867].
Block recovery from logseq 17833, block 46 to scn 1551059886
Recovery of Online Redo Log: Thread 1 Group 1 Seq 17833 Reading mem 0
Mem# 0: +DATADG/xxxxx/onlinelog/group_1.257.956077213
Mem# 1: +ARCHDG/xxxxx/onlinelog/group_1.257.956077217
Block recovery stopped at EOT rba 17833.54.16
Block recovery completed at rba 17833.54.16, scn 0.1551059805
Block recovery from logseq 17833, block 46 to scn 1551059575
Recovery of Online Redo Log: Thread 1 Group 1 Seq 17833 Reading mem 0
Mem# 0: +DATADG/xxxxx/onlinelog/group_1.257.956077213
Mem# 1: +ARCHDG/xxxxx/onlinelog/group_1.257.956077217
Block recovery completed at rba 17833.46.16, scn 0.1551059664
Errors in file /u01/app/oracle/diag/rdbms/xxxxx/xxxxx1/trace/xxxxx1_mmon_2939.trc  (incident=323868):
ORA-00600: internal error code, arguments: [4193], [], [], [], [], [], [], [], [], [], [], []
Incident details in: /u01/app/oracle/diag/rdbms/xxxxx/xxxxx1/incident/incdir_323868/xxxxx1_mmon_2939_i323868.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Dumping diagnostic data in directory=[cdmp_20200427025534], requested by (instance=1, osid=2939 (MMON)), summary=[incident=323868].
Block recovery from logseq 17833, block 46 to scn 1551059886
Recovery of Online Redo Log: Thread 1 Group 1 Seq 17833 Reading mem 0
Mem# 0: +DATADG/xxxxx/onlinelog/group_1.257.956077213
Mem# 1: +ARCHDG/xxxxx/onlinelog/group_1.257.956077217
Block recovery completed at rba 17833.54.16, scn 0.1551059888
Block recovery from logseq 17833, block 46 to scn 1551059888
Recovery of Online Redo Log: Thread 1 Group 1 Seq 17833 Reading mem 0
Mem# 0: +DATADG/xxxxx/onlinelog/group_1.257.956077213
Mem# 1: +ARCHDG/xxxxx/onlinelog/group_1.257.956077217
Block recovery completed at rba 17833.56.16, scn 0.1551059889
Mon Apr 27 02:55:36 2020
。。。。。。

对于此次恢复case总体来讲比较简单,这里提供一下处理思路:

  1. 首先利用客户的归档和Redo(Redo log客户已copy到了本地进行备份),进行正常的recover database using backup controlfile until cancel操作;

  2. 由于恢复过程中出现了ora-00600 [3020]错误,因此需要通过recover database allow xx corruption的方式进行;

  3. 完成恢复之后尝试打开数据库;

  4. 打开数据库时仍然提示ora-01113和ora-01110错误,即system文件还需要进行恢复;这种情况下只能先强制拉库;通过加入*._allow_resetlogs_corruption=TRUE *._allow_error_simulation=true 等隐含参数即可;

  5. 若上述参数后仍然提示undo 存在坏块,由于该数据库版本在报错时不会直接提示具体是哪个回滚段有问题,在alter database open resetlogs之前通过10046 event定位error cursor,找到回滚段名称即可;

  6. 使用_corrupted_rollback_segments参数屏蔽回滚段;

  7. open resetlogs数据库成功;

  8. 最后就是重建undo以及处理相关坏块等善后工作。

 

恢复完成之后,由于客户担心HW超融合环境再次出现故障因此进行了全库备份并进行数据迁移到新平台,到这里这个case告一段落。

 

再次叮嘱大家,注意数据库备份、注意数据容灾环境建设!

推荐阅读:144页!分享珍藏已久的数据库技术年刊

数据和云

ID:OraNews

如有收获,请划至底部,点击“在看”,谢谢!

点击下图查看更多 ↓

云和恩墨大讲堂 | 一个分享交流的地方

长按,识别二维码,加入万人交流社群

请备注:云和恩墨大讲堂

  点个“在看”

你的喜欢会被看到❤

原文地址:https://www.cnblogs.com/hzcya1995/p/13311680.html