oracle-SYSTEM表空间的备份与恢复

这一篇在介绍备份及恢复数据文件的方法时，以备份和重做日志（包括归档日志和在线日志）没有丢失为前提

所谓关键数据文件：system表空间的数据文件与参数undo_tablespace指向的自动撤销表空间的数据文件（undo_tablespace数据文件）。

它们的损坏（整体或局部）会导致SQL命令执行失败、用户session强制断开、sys用户无法登陆、甚至整个实例崩溃。

SQL> select file_id,file_name from dba_data_files where tablespace_name in ('SYSTEM',(select value from v$parameter where name='undo_tablespace'));
   FILE_ID FILE_NAME
---------- --------------------------------------------------
3 /u01/app/oracle/oradata/orcl/undotbs01.dbf
1 /u01/app/oracle/oradata/orcl/system01.dbf

9.1 关键数据文件损坏的后果

9.1.1 system表空间数据文件损坏

SYSTEM表空间内部保存两类重要数据：oracle数据库的系统表（数据字典），是数据库正常运行的基本保障：以及名为SYS.SYSTEM的撤销段（undo segment）系统回滚段。

讨论情况：文件丢失、文件头损坏、数据库字段损坏及SYS.SYSTEM撤销段损坏

--1 若system01.dbf文件丢失或无法访问，startup启动到mount状态
--2 若system01.dbf文件头损坏，运行时检查点发起后实例崩溃，startup启动到mount状态
--3 如果数据字典损坏，数据库内的对象定义系统、名称解析系统、用户账号系统及权限管理都将崩溃。若损坏发生在实例运行时，通常会导致SQL命令产生ORA-00604的错误；若损坏发生在实例启动时，启动流程不一定会终止，但是alert log中会有ORA-01578和ORA-01110错误。
--4 system01.dbf 文件名中为SYS.SYSTEM撤销段头部损坏，在启动时startup实例会强制关闭，必须使用startup mount才能进入mount状态。

以下是一些各种是system01.dbf文件损坏的场景

场景1：启动数据库是发现system01.dbf文件丢失，启动中断

SQL> startup
ORA-01157: cannot identify/lock data file 1 -see DBWR trace file
ORA-01110:data file 1 : ‘/u01/app/oracle/oradata/orcl/system01.dbf’

场景2：启动数据库发现system01.dbf文件头部损坏，启动中断

SQL> startup
ORACLE instance started.
Total System Global Area  784998400 bytes
Fixed Size            2257352 bytes
Variable Size          511708728 bytes
Database Buffers      264241152 bytes
Redo Buffers            6791168 bytes
Database mounted.
ORA-01122: database file 1 failed verification check
ORA-01110: data file 1: '/u01/app/oracle/oradata/orcl/system01.dbf'
ORA-01210: data file header is media corrupt

场景3：数据库运行时，system01.dbf文件中保存用户账号信息的数据字典SYS.USER$的数据块损坏，使用后登录失败

$ sqlplus test/***
ERROR:
ORA-00604: error occurred at recursive SQL level 1
ORA-01578: ORACLE data block corrupted (file # 1,block # 213)
ORA-01110: data file 1: '/u01/app/oracle/oradata/orcl/system01.dbf'

场景4：数据库运行时system01.dbf文件中的数据字典SYS.TAB$数据块损坏

SQL> select * from test.t1;
ERROR at line 1:
ORA-00604: error occurred at recursive SQL level 1
ORA-01578: ORACLE data block corrupted (file # 1,block # 83226)
ORA-01110: data file 1: '/u01/app/oracle/oradata/orcl/system01.dbf'

场景5：数据字典SYS.PROCEDURE$中数据块损坏，任何create、drop和rename都报错

SQL> create table test.t2 (id number,name varchar2(20)) tablespace test;
ERROR at line 1:
ORA-00604: error occurred at recursive SQL level 1
ORA-01578: ORACLE data block corrupted (file # 1,block # 89226)
ORA-01110: data file 1: '/u01/app/oracle/oradata/orcl/system01.dbf'

场景6：SYS.SYSTEM撤销段头部损坏，实例被强制中断

SQL> startup
ORA-01092: ORACLE instance terminated, Disconnection forced
ORA-01578: ORACLE data block corrupted (file # 1,block # 128)
ORA-01110: data file 1: '/u01/app/oracle/oradata/orcl/system01.dbf'

场景7：SYS.SYSTEM撤销段与undo_tablespace表空间撤销段同时损坏，执行DDL时候报错

SQL> drop table test.t1;
ORA-00603 : ORACLE server session terminated by fatal error

场景8：运行时，system01.dbf或undo_tablespace数据文件头部损坏，检查点无法顺利完成，在alter log中

ORA-01243: system tablespace file suffered media failure
ORA-01122: database file 1 failed verification check

9.1.2 undo_tablespace数据文件损坏

undo_tablespace数据文件是undotbs01.dbf，它用来保存所有的变更类命令（DDLDML）所产生的撤销数据（undo data）

--1 undotbs01.dbf文件丢失无法访问，startup启动到mount状态
--2 undotbs01.dbf文件头损坏，startup启动到mount，运行时检查点发起后实例崩溃
--3 若表空间中的某些块损坏，DML可能失败，若全部损坏，DML肯定全部失败

SQL> select name from v$rollname where name<> 'SYSTEM';
NAME
------------------------------
_SYSSMU1_1925302723$
_SYSSMU2_2273571325$
_SYSSMU3_798971445$
_SYSSMU4_2493343136$
_SYSSMU5_44098047$
_SYSSMU6_4194993272$
_SYSSMU7_3978436573$
_SYSSMU8_3643869769$
_SYSSMU9_4157155965$
_SYSSMU10_1224346732$
10 rows selected.

场景1：运行时，undotbs01.dbf头损坏（3号文件，160号数据块）

SQL> update test.t1 set name=’yhq’ where id=1;
ERROR at line 1:
ORA-00603: ORACLE server session terminated by fatal error
ORA-01578: ORACLE data block corrupted (file # 3,block # 160)
ORA-01110: data file 3: ‘/u01/app/oracle/oradata/orcl/undotbs01.dbf’

场景2：启动时，undotbs01.dbf文件头部损坏

SQL> startup
ORA-01092: ORACLE instance terminated, Disconnection forced
ORA-01578: ORACLE data block corrupted (file # 3,block # 128)
ORA-01110: data file 3: '/u01/app/oracle/oradata/orcl/undotbs01.dbf'

9.2 备份

使用RMAN的backup database、backup datafile、backup tablespace都可以备份数据文件

RMAN> backup as backupset tablespace system,undotbs1;
Starting backup at 18-JUL-19
using channel ORA_DISK_1
channel ORA_DISK_1: starting full datafile backup set
channel ORA_DISK_1: specifying datafile(s) in backup set
input datafile file number=00003 name=/u01/app/oracle/oradata/orcl/undotbs01.dbf
input datafile file number=00001 name=/u01/app/oracle/oradata/orcl/system01.dbf
channel ORA_DISK_1: starting piece 1 at 18-JUL-19
channel ORA_DISK_1: finished piece 1 at 18-JUL-19
piece handle=/u01/app/oracle/fra/ORCL/backupset/2019_07_18/o1_mf_nnndf_TAG20190718T111222_glzrwp9z_.bkp tag=TAG20190718T111222 comment=NONE
channel ORA_DISK_1: backup set complete, elapsed time: 00:00:03
Finished backup at 18-JUL-19
Starting Control File and SPFILE Autobackup at 18-JUL-19
piece handle=/u01/app/oracle/fra/ORCL/autobackup/2019_07_18/o1_mf_s_1013944345_glzrwsd5_.bkp comment=NONE
Finished Control File and SPFILE Autobackup at 18-JUL-19

创建镜像复制

RMAN> run {
allocate channel c1 device type disk;
allocate channel c2 device type disk;
backup as copy
(datafile '/u01/app/oracle/oradata/orcl/system01.dbf' channel c1)
(datafile '/u01/app/oracle/oradata/orcl/undotbs01.dbf' channel c2)
}

备份所有的数据文件

RMAN> run {
allocate channel c1 device type disk;
allocate channel c2 device type disk;
allocate channel c3 device type disk;
backup as copy database;
}

运行以上RMAN命令时，确保在mount状态或open状态，open状态还需要启动归档模式。

9.3 恢复

恢复关键数据文件的核心步骤：db进入mount状态、从备份还原（restore或switch）、使用增量备份或重做日志恢复（recover）、打开db

在整个介质恢复流程中（restore和recover），db始终处于mount状态，而不是open状态，在恢复完成之前db不能接受应用的连接。

9.3.1 恢复前的准备

要恢复数据文件，先要启动到mount阶段，不然就需要搞定参数文件和控制文件。

显示启动到mount阶段：startup mount

如果发现问题时实例没有关闭，用：shutdown abort停止实例

也有可能数据字典的损坏甚至SYS不能通过sqlplus和RMAN登录的情况

$ sqlplus / as sysdba
ERROR:
ORA-01075：you are currently logged on

登录不报错，连接到空闲实例，但实例还存在

SQL> sqlplus / as sysdba
SQL> select * from v$database;
ERROR at line 1 :
ORA-01012: not logged on
Process ID: 0
Session ID: 0 Serial number :0

使用RMAN登录也报错

$ rman target /
RMAN-00554
RMAN-04005: error from target database
ORA-00604: error occurred at recursive SQL level 1
ORA-01578: ORACLE data block corrupted (file # 1,block #857)
ORA-01110: data file 1: '/u01/app/oracle/oradata/orcl/system01.dbf'

此时必须终止实例才能开始恢复操作，比如将后台进程SMON 杀死，另一个后台进程PMON

$ kill -9 `ps aux |grep ora_smon_orcl |grep -v grep | awk '{print $2}'`

9.3.2 恢复流程

恢复操作必须在mount下进行，具体步骤：

--1 如果实例尚未崩溃，使用shutdown abort或者操作系统的kill将实例关闭
--2 执行startup mount将实例启动到mount状态
--3 使用RMAN执行restore或switch还原损坏的关键数据文件
--4 使用RMAN执行recover database 利用归档日志和在线日志恢复数据文件
--5 执行alter database open 打开数据库

第一步确保实例已经停止，可以通过RMAN的一个运行块完成，比如恢复undotbs1表空间的数据文件

RMAN> run {
startup mount;
restore tablespace undotbs1;
recover database;
alter database open;
}

再比如恢复1号数据文件

RMAN> run {
startup mount;
restore datafile 1;
recover database;
alter database open;
}

当数据文件的镜像复制处于磁盘上时，可用switch命令取代restore将控制文件中的数据文件名立即换成镜像复制名，文件越大，还原操作节省的时间就越多。

首先启动到mount状态

RMAN> startup mount;

查看镜像复制信息和生成时间

RMAN> list datafilecopy all;
RMAN> run {
switch datafile 1 to datafilecopy
'/u01/app/oracle/fra/ORCL/autobackup/2019_07_18/o1_mf_s__glzrwsd5_.dbf';
recover database;
alter database open;
}

现在查看1号数据文件的路径将是镜像复制

SQL> select name from v$datafile where file#=1;
NAME
/u01/app/oracle/fra/ORCL/autobackup/2019_07_18/o1_mf_s__glzrwsd5_.dbf

而查看镜像复制的路径将是原来的数据文件

RMAN> list copy of tablespace system;
/u01/app/oracle/oradata/orcl/system01.dbf

如果原来数据文件已经是损坏的，此镜像复制当然也是损坏的，dba需要考虑这样的复制是否有意义，所以在使用switch之后要执行validate

RMAN> validate datafilecopy all;

如有错误，可以删除

RMAN> delete noprompt datafilecopy 44;
RMAN> backup as copy datafile 1 format ‘/u01/app/oracle/oradata/orcl/system01.dbf’;

将来不论是主动或被动，利用switch和recover都能再次切换回原路径