UNIX故障--sun m4000服务器故障硬盘更换案例

一、故障诊断

查看messages日志c0d0t0这块盘不断报错,类型为:retryable,如下:

root@gdhx # more /var/adm/messages

Aug  5 16:43:03 gdhx scsi: [ID 107833 kern.warning] WARNING: /pci@0,600000/pci@0/pci@8/pci@0/scsi@1/sd@1,0 (sd0):

Aug  5 16:43:03 gdhx    Error for Command: write(10)               Error Level: Retryable

Aug  5 16:43:03 gdhx scsi: [ID 107833 kern.notice]      Requested Block: 30334832                  Error Block: 30334848

Aug  5 16:43:03 gdhx scsi: [ID 107833 kern.notice]      Vendor: FUJITSU                            Serial Number: 0816H01WMN  

Aug  5 16:43:03 gdhx scsi: [ID 107833 kern.notice]      Sense Key: Hardware Error

Aug  5 16:43:03 gdhx scsi: [ID 107833 kern.notice]      ASC: 0x44 (<vendor unique code 0x44>), ASCQ: 0xa3, FRU: 0x0

Iostat -En查看硬盘hard errors20,处于增长趋势

综上结论:需更换c0d0t0硬盘,可在线更换硬盘。

二、故障处理过程

1、svm信息

root@gdhx # metastat

d4: 镜像

    次镜像 0: d14

      状态: 确定         

    次镜像 1: d24

      状态: 确定         

    传送: 1

   读入选项: roundrobin (缺省)

    写入选项: parallel (缺省)

    大小: 167781888 (80 GB)

d14: d4 的次镜像

    状态: 确定         

    大小: 167781888 (80 GB)

     0:

        设备       引导块       Dbase         状态 Reloc 热备援

        c0t0d0s4          0     否            确定    是

d24: d4 的次镜像

    状态: 确定         

    大小: 167781888 (80 GB)

     0:

        设备       引导块       Dbase         状态 Reloc 热备援

        c0t1d0s4          0     否            确定    是

d1: 镜像

    次镜像 0: d11

      状态: 确定         

    次镜像 1: d21

      状态: 确定         

    传送: 1

   读入选项: roundrobin (缺省)

    写入选项: parallel (缺省)

    大小: 16790400 (8.0 GB)

d11: d1 的次镜像

    状态: 确定         

    大小: 16790400 (8.0 GB)

     0:

        设备       引导块       Dbase         状态 Reloc 热备援

        c0t0d0s1          0     否            确定    是

d21: d1 的次镜像

    状态: 确定         

    大小: 16790400 (8.0 GB)

     0:

        设备       引导块       Dbase         状态 Reloc 热备援

        c0t1d0s1          0     否            确定    是

d0: 镜像

    次镜像 0: d10

      状态: 确定         

    次镜像 1: d20

      状态: 确定         

    传送: 1

   读入选项: roundrobin (缺省)

    写入选项: parallel (缺省)

    大小: 100355712 (47 GB)

d10: d0 的次镜像

    状态: 确定         

    大小: 100355712 (47 GB)

     0:

        设备       引导块       Dbase         状态 Reloc 热备援

        c0t0d0s0          0     否            确定    是

d20: d0 的次镜像

    状态: 确定         

    大小: 100355712 (47 GB)

     0:

        设备       引导块       Dbase         状态 Reloc 热备援

        c0t1d0s0          0     否            确定    是

Device Relocation Information:

Device   Reloc  Device ID

c0t1d0   是     id1,sd@n500000e01aff7320

c0t0d0   是     id1,sd@n5000c5001782f5b3

root@gdhx # df -k

文件系统              千字节    用了    可用 容量      挂接在

/dev/md/dsk/d0       49418200 33780228 15143790    70%    /

/devices                   0       0       0     0%    /devices

ctfs                       0       0       0     0%    /system/contract

proc                       0       0       0     0%    /proc

mnttab                     0       0       0     0%    /etc/mnttab

swap                 2203656    1720 2201936     1%    /etc/svc/volatile

objfs                      0       0       0     0%    /system/object

sharefs                    0       0       0     0%    /etc/dfs/sharetab

fd                         0       0       0     0%    /dev/fd

swap                 2203256    1320 2201936     1%    /tmp

swap                 2201984      48 2201936     1%    /var/run

/dev/md/dsk/d4       82620893 59432265 22362420    73%    /bea

root@gdhx # metadb

        flags           first blk       block count

     a m  p  luo        16              8192            /dev/dsk/c0t0d0s7

     a    p  luo        8208            8192            /dev/dsk/c0t0d0s7

     a    p  luo        16400           8192            /dev/dsk/c0t0d0s7

     a    p  luo        16              8192            /dev/dsk/c0t1d0s7

     a    p  luo        8208            8192            /dev/dsk/c0t1d0s7

root@gdhx # metastat -p

d4 -m d14 d24 1

d14 1 1 c0t0d0s4

d24 1 1 c0t1d0s4

d1 -m d11 d21 1

d11 1 1 c0t0d0s1

d21 1 1 c0t1d0s1

d0 -m d10 d20 1

d10 1 1 c0t0d0s0

d20 1 1 c0t1d0s0

2、删除报错硬盘的分区镜像

 

metadetach -f d0 d20

metadetach -f d1 d21

metadetach -f d4 d24

 

metaclear d20

metaclear d21

metaclear d24

 

3、删除报错硬盘的状态数据库

metadb -d /dev/dsk/c0t1d0s7

4、物理更换硬盘(确定硬盘亮蓝灯才更换),format确认硬盘正常被识别

 

5、复制分区

prtvtoc /dev/rdsk/c0t0d0s2 | fmthard -s - /dev/rdsk/c0t1d0s2

 

6、创建新盘状态数据库

metadb -a -f -c 2 c0t1d0s7

 

7、创建镜像

metainit d20 1 1 c0t1d0s0

metainit d21 1 1 c0t1d0s1

metainit d24 1 1 c0t1d0s4

 

metattach d0 d20

metattach d1 d21

metattach d4 d24

 

8、查看镜像数据同步进度

metastat |grep %

 

三、检查设备

 

查看硬盘状态、日志等信息

Format

Iostat -En

messages

原文地址:https://www.cnblogs.com/xweiqing/p/9075445.html