DS4700控制器重启原因分析

DS4700控制器重启故障原因分析

 

 

版本历史:

1.0

初稿

2017/5/10

 

 

 

 

说明:本文档内容来自于IBM官方手册,可以作为建议使用。


第 1 章           环境说明

1.1 控制器微码

当前控制器微码和硬盘微码均是官方推荐的最新版本。

1.2 接口信息

根据现场的同事反馈,A控和B控的host channel port分别和主机直连。

1.3 主机信息

host_alias

host_type                                         

host_group  

Controller

Logical_Drive_Name

windown1_a

Windows 2000/Server 2003/Server 2008 Non-Clustered

windows_group

 B       

 4,5

windown2_a

Windows 2000/Server 2003/Server 2008 Non-Clustered

windows_group

 B       

 4,5

app1_hba 

Solaris (with or without MPXIO)                   

app_group   

 A       

 3

app2_hba 

Solaris (with or without MPXIO)                   

app_group   

 A       

 1,2,6

dbhba1   

Solaris (with or without MPXIO)                   

db_group    

 A       

 1,2,6

db2_hba  

Solaris (with or without MPXIO)                   

db_group    

 A       

 1,2,6

 

第 2 章          故障现象

5/10 A控发生重启,导致对应的appdb业务中断,对应的业务名称如下:

10.1.121.129    M01-HQ-SV013-DB1   数据库主
10.1.121.130    M01-HQ-SV013-DB2   数据库备(DG数据库)
10.1.121.132    M01-HQ-SV014-APP1  
应用服务器(主机)
10.1.121.133    M01-HQ-SV014-APP2  
应用服务器(备机)

第 3 章          故障分析原因

由于是单链路,主机端也没有多路径,导致控制器重启后链路中断。

根据存储日志分析A控的重启日志:

Date/Time: 15-2-3 8:38:19

Sequence number: 4698

Event type: 400F

Event category: Internal

Priority: Informational

Description: Controller reset by its alternate

Event specific codes: 0/0/0

Component type: Controller

Component location: Enclosure 85, Slot 1

 

Date/Time: 17-5-10 7:47:44

Sequence number: 6300

Event type: 400F

Event category: Internal

Priority: Informational

Description: Controller reset by its alternate

Event specific codes: 0/0/0

Component type: Controller

Component location: Enclosure 85, Slot 1

2015-2-32017-5-10日,时间间隔827

再看A控制器的最近2Start-of-day routine begun的日期:

Date/Time: 15-2-10 17:25:12

Date/Time: 17-5-10 7:47:04

2015-2-102017-5-10,时间间隔820天。

由此可以判断下来是存储820/825 日期问题导致的重启。

存储每820/825天检测一次控制器的运行天数

A控上次运行这个日期检测程序是2015/2/10日,到2017/5/10日刚好820

A控上次重启的日期是2015/2/3日,到2017/5/10日刚好827天,所以A控重启了。

 

另外通过历史日志检查,发现B控在6年中有重启过4次,而B控上主机端有2FC口没有接主机,如果有SFP模块的话建议插上堵头或者拨掉SFP模块。

 

第 4 章          后续建议

1.     修改存储的链路设计,实现高可用冗余连接

2.     考虑到DS4K系列的820/825的设计,到期前进行预防性重启。

 

第 5 章          附录

关于DS4K 820/825的说明

H193288: DS3000/DS4000/DS5000 controllerwill reboot every 820 or 825 days

5.1 Technote(troubleshooting)



5.2Problem(Abstract)

RETAINtip: H193288

5.3Symptom

TheIBM System Storage DS3000, DS4000, and DS5000 families of storage subsystemcontrollers will reboot every 820 days for controller A or 825 days for controllerB, if the controller firmware is not upgraded or already rebooted within thattime period.

Affectedconfigurations

Thesystem may be any of the following IBM servers:

·       DS4100 (FAStT100) Dual-Controller Storage Server, type 1724, anymodel

·       DS4100 (FAStT100) Single-Controller Storage Server, type 1724,any model

·       DS4200 Storage Server, type 1814, any model

·       DS4300 (FAStT600) Dual Controller and Turbo Storage Server, type1722, any model

·       DS4300 (FAStT600) Single Controller Storage Server, type 1722,any model

·       DS4400 (FAStT700) Storage Server, type 1742, any model

·       DS4500 (FAStT900) Storage Server, type 1742, any model

·       DS4700 Storage Server, type 1814, any model

·       DS4700 Storage Server, type 1814 (DC power supplies), any model

·       DS4800 Storage Server, type 1815, any model

·       DS5020 Disk Controller (1814-20A), any model

·       DS5100 Storage Controller, type 1818, any model

·       DS5300 Storage Controller, type 1818, any model

·       FAStT 200 Storage Server, type 3542, any model

·       FAStT500 RAID Controller, type 3552, any model

·       FAStT500, type 3552, any model

·       IBM System Storage DS3200, type 1726, any model

·       IBM System Storage DS3300, type 1726, any model

·       IBM System Storage DS3400, type 1726, any model

·       IBM System Storage DS3512, type 1746, any model

·       IBM System Storage DS3524, type 1746, any model

·       IBM System Storage DS3950 Express, type 1814, any model


The system is configured with one or more of the following IBM Options:

·       BladeCenter Boot Disk System (1726-22B), any model


This tip is not software specific.

Solution

Forthe DS3500, DCS3700, and DCS3860, this issue is fixed in the 8.2x release. Forall other products, this is a permanent restriction and there will be nosolution.

Workaround

Whenevera controller is rebooted, the firmware will reset the timer mechanism, givingthe controllers another 828.5 days on the timer. The next reboots will occur at820 days for controller A or 825 days for controller B.

Theway to avoid these unexpected reboots is with a controller firmware upgrade,since the process of upgrading controller firmware will reboot the controllers,thereby, resetting the timer mechanism. This also allows for the reboots to bescheduled at a convenient time for the customer's environment.

Upgradingfirmware to the levels below is also recommended to reduce the possibility ofthe controller reboots happening at same time.
DS3000 - 07.35.41.00 or higher
DS4000 - 07.15.07.00 or higher
DS5000 - 07.30.21.00 or higher

IBM's best recommended practice is to maintain the environment with regularfirmware upgrades, at least once per year, to leverage the enhancementsimplemented in firmware and provide the best possible quality, performance, andavailability of the system.

Ifthese recommended best practices are followed, then the reboot behavior willnot be observed.

Regularlyscheduled maintenance of controller firmware will reset the timer, since thisprocess reboots the controller. A reboot, for any other reason, will also causethe timer to be reset.

Additionalinformation

Thecurrent design of the DS3000, DS4000, and DS5000 controller operating systemcontains a separate timer for each controller. Each timer rolls over after828.5 days. In order to keep the timer from rolling over, the controller isdesigned to reboot after 825.5 days to reset the timer. These timers are independentof each other, however, there is a possibility that the controllers couldreboot at the same time. Firmware levels 07.35.41.00, 07.15.07.00, and07.30.21.00 were changed to stagger the controller reboots - controller A willreboot at 820 days and controller B will reboot at 825 days. This eliminatesthe simultaneous controller reboot condition, and allows the two redundantcontrollers to protect each other using the normal failover/failbackoperations.

Aproperly maintained DS3000, DS4000, and DS5000 system includes periodicfirmware upgrades. These firmware upgrades should never allow the controllersto get to the point where the timer rolls over.

IBMhighly recommends to periodically upgrade controller firmware. Firmwareupgrades should be part of a yearly Change Management plan.

 

 

Segment

Product

Component

Platform

Version

Edition

Disk Storage Systems

DS3950

Disk Storage Systems

DS4200

Disk Storage Systems

DS4700

Disk Storage Systems

DS4800

Disk Storage Systems

DS5020

Disk Storage Systems

DS5100

Disk Storage Systems

DS3200

Disk Storage Systems

DS3300

Disk Storage Systems

DS3400

Disk Storage Systems

BladeCenter Boot Disk System

Disk Storage Systems

DS3500 (DS3512- DS3524)

Disk Storage Systems

DS4100

Disk Storage Systems

DS4300

Disk Storage Systems

DS4400

Disk Storage Systems

DS4500

Disk Storage Systems

FAStT500 Storage Server

Disk Storage Systems

DCS3700

Disk Storage Systems

System Storage DCS3860

Cross reference information

 

 

原文地址:https://www.cnblogs.com/jonathanyue/p/9301155.html