OEM报错"Failed to connect to ASM instance. The connection is closed: The connection is closed"处理

OEM报错"Failed to connect to ASM instance. The connection is closed: The connection is closed"处理

 

前言

秉着出现的报错就追根问底的原则,这次刚部署不久的OEM 13C有出现如下报警:

Host=xxxxx1 
Target type=Automatic Storage Management 
Target name=+ASM1_xxxxx1 
Categories=Availability 
Message=Failed to connect to ASM instance. The connection is closed: The connection is closed 
Severity=Fatal 
Event reported time=Aug 9, 2020 10:08:18 AM CST 
Operating System=Linux
Platform=x86_64
Associated Incident Id=88 
Associated Incident Status=New 
Associated Incident Owner= 
Associated Incident Acknowledged By Owner=No 
Associated Incident Priority=None 
Associated Incident Escalation Level=0 
Event Type=Target Availability 
Event name=Status 
Availability status=Down
Root Cause Analysis Status=Neither Cause Nor Symptom 
Causal analysis result=Neither a cause nor a symptom 
Rule Name=Incident management rule set for all targets,Incident creation rule for a Target Down availability status 
Rule Owner=System Generated 
Update Details:
Failed to connect to ASM instance. The connection is closed: The connection is closed
Incident created by rule (Name = Incident management rule set for all targets, Incident creation rule for a Target Down availability status [System generated rule]).

照例问度娘是没问出啥来......

MOS上搜的话就有结果了:

EM 13c: Enterprise Manager 13.2 Cloud Control ASM Incident Reported with Message=Failed To Connect To ASM Instance. The Connection Is Closed: The Connection Is Closed (Doc ID 2251591.1)

文档中提到,这个一个BUG。

验证

文档中提到,在gcagent.log日志会有如下报错(示例):

[65336:GC.Executor.126 (osm_instance:+ASM__host.company.com:ofs_performance_metrics) (osm_instance:+ASM__host.company.com:ofs_performance_metrics:Instance_Volume_Performance)] ERROR - The connection is closed: The connection is closed
java.sql.SQLException: The connection is closed: The connection is closed
at oracle.ucp.util.UCPErrorHandler.newSQLException(UCPErrorHandler.java:464)
at oracle.ucp.util.UCPErrorHandler.newSQLException(UCPErrorHandler.java:448)
at oracle.ucp.jdbc.proxy.JDBCConnectionProxyFactory.invoke(JDBCConnectionProxyFactory.java:307)
at oracle.ucp.jdbc.proxy.ConnectionProxyFactory.invoke(ConnectionProxyFactory.java:50)
at com.sun.proxy.$Proxy27.prepareCall(Unknown Source)

该日志位于客户端如下位置:

[oracle@xxxxx1 log]$ ll $AGENT_HOME/sysman/log/gcagent.log
-rw-r----- 1 oracle oinstall 960998 Aug 10 14:20 /u01/app/oem13c/agent/agent_inst/sysman/log/gcagent.log

查看日志可以发现,确实存在相似的日志信息:

2020-08-09 10:08:18,645 [99899:GC.Executor.23807 (osm_instance:+ASM1_xxxxx1:Response) (osm_instance:+ASM1_xxxxx1:Response:Response)] ERROR - The connection is closed: The connection is closed
java.sql.SQLException: The connection is closed: The connection is closed
        at oracle.ucp.util.UCPErrorHandler.newSQLException(UCPErrorHandler.java:464)
        at oracle.ucp.util.UCPErrorHandler.newSQLException(UCPErrorHandler.java:448)
        at oracle.ucp.jdbc.proxy.JDBCConnectionProxyFactory.invoke(JDBCConnectionProxyFactory.java:307)
        at oracle.ucp.jdbc.proxy.ConnectionProxyFactory.invoke(ConnectionProxyFactory.java:50)
        at com.sun.proxy.$Proxy31.prepareCall(Unknown Source)

文档中还提到,

此外,如果在EM代理进程上进行了线程转储,则会观察到大量的"Timer-"线程(它们随着时间的推移而增加,并且从未关闭/结束)。例:

jstack <Agent PID>|grep "Timer-"|wc -l
983

注意:根据经验,"Timer-"线程的数量应随时间保持恒定,少于50,但这是一个近似值,因为它取决于目标数量,监视设置,执行的作业以及许多其他因素。关键因素是随着时间(天)的增加,此类线程的数量将保持恒定。

问题节点再次验证如下:

[oracle@xxxxx1 ~]# ps -ef | grep java
...省略部分内容...
oracle    7687  7601  3 Aug05 ?        03:34:38 /u01/app/oem13c/agent/agent_13.3.0.0.0/oracle_common/jdk/bin/java -Xmx128M -XX:MaxPermSize=128M -server -Djava.security.egd=file:///dev/./urandom -Dsun.lang.ClassLoader.allowArraySyntax=true -XX:-UseLargePages -XX:+UseLinuxPosixThreadCPUClocks -XX:+UseConcMarkSweepGC -XX:+CMSClassUnloadingEnabled -XX:+UseCompressedOops -Dwatchdog.pid=7601 -cp /u01/app/oem13c/agent/agent_13.3.0.0.0/jdbc/lib/ojdbc7.jar:/u01/app/oem13c/agent/agent_13.3.0.0.0/ucp/lib/ucp.jar:/u01/app/oem13c/agent/agent_13.3.0.0.0/oracle_common/modules/jsch-0.1.53.jar:/u01/app/oem13c/agent/agent_13.3.0.0.0/oracle_common/modules/com.oracle.http_client.http_client_12.1.3.jar:/u01/app/oem13c/agent/agent_13.3.0.0.0/oracle_common/modules/oracle.xdk_12.1.3/xmlparserv2.jar:/u01/app/oem13c/agent/agent_13.3.0.0.0/oracle_common/modules/oracle.dms_12.1.3/dms.jar:/u01/app/oem13c/agent/agent_13.3.0.0.0/oracle_common/modules/oracle.odl_12.1.3/ojdl.jar:/u01/app/oem13c/agent/agent_13.3.0.0.0/oracle_common/modules/oracle.odl_12.1.3/ojdl2.jar:/u01/app/oem13c/agent/agent_13.3.0.0.0/lib/optic.jar:/u01/app/oem13c/agent/agent_13.3.0.0.0/sysman/jlib/log4j-core.jar:/u01/app/oem13c/agent/agent_13.3.0.0.0/jlib/gcagent_core.jar:/u01/app/oem13c/agent/agent_13.3.0.0.0/sysman/jlib/emagentSDK-intg.jar:/u01/app/oem13c/agent/agent_13.3.0.0.0/sysman/jlib/emagentSDK.jar oracle.sysman.gcagent.tmmain.TMMain   
[oracle@xxxxx1 ~]$ /u01/app/oem13c/agent/agent_13.3.0.0.0/oracle_common/jdk/bin/jstack 7687 | grep "Timer-" | wc -l
83

另外一个没报警的节点情况:

[oracle@xxxxx2 ~]$ ps -ef | grep 13.3
oracle   31845 31753  0 Aug05 ?        00:26:26 /u01/app/oem13c/agent/agent_13.3.0.0.0/oracle_common/jdk/bin/java -Xmx128M -XX:MaxPermSize=128M -server -Djava.security.egd=file:///dev/./urandom -Dsun.lang.ClassLoader.allowArraySyntax=true -XX:-UseLargePages -XX:+UseLinuxPosixThreadCPUClocks -XX:+UseConcMarkSweepGC -XX:+CMSClassUnloadingEnabled -XX:+UseCompressedOops -Dwatchdog.pid=31753 -cp /u01/app/oem13c/agent/agent_13.3.0.0.0/jdbc/lib/ojdbc7.jar:/u01/app/oem13c/agent/agent_13.3.0.0.0/ucp/lib/ucp.jar:/u01/app/oem13c/agent/agent_13.3.0.0.0/oracle_common/modules/jsch-0.1.53.jar:/u01/app/oem13c/agent/agent_13.3.0.0.0/oracle_common/modules/com.oracle.http_client.http_client_12.1.3.jar:/u01/app/oem13c/agent/agent_13.3.0.0.0/oracle_common/modules/oracle.xdk_12.1.3/xmlparserv2.jar:/u01/app/oem13c/agent/agent_13.3.0.0.0/oracle_common/modules/oracle.dms_12.1.3/dms.jar:/u01/app/oem13c/agent/agent_13.3.0.0.0/oracle_common/modules/oracle.odl_12.1.3/ojdl.jar:/u01/app/oem13c/agent/agent_13.3.0.0.0/oracle_common/modules/oracle.odl_12.1.3/ojdl2.jar:/u01/app/oem13c/agent/agent_13.3.0.0.0/lib/optic.jar:/u01/app/oem13c/agent/agent_13.3.0.0.0/sysman/jlib/log4j-core.jar:/u01/app/oem13c/agent/agent_13.3.0.0.0/jlib/gcagent_core.jar:/u01/app/oem13c/agent/agent_13.3.0.0.0/sysman/jlib/emagentSDK-intg.jar:/u01/app/oem13c/agent/agent_13.3.0.0.0/sysman/jlib/emagentSDK.jar oracle.sysman.gcagent.tmmain.TMMain
[oracle@xxxxx2 ~]$ /u01/app/oem13c/agent/agent_13.3.0.0.0/oracle_common/jdk/bin/jstack 31845 | grep "Timer-" | wc -l
7

 文档中给出经验值为<50,我在问题节点可以看出"Timer-"线程的数量为83,作为一个参考值,说明该节点很有可能出现了BUG。

处理

这是BUG导致的,13.2/13.3/13.4均存在此问题,不过对应BUG号不同,因此补丁也不同。

13.3的对应BUG为Bug 28406747,对应在agent段打上该补丁即可。

如何打补丁

首先是一个打补丁的目标的问题,之前给OMS打psu的时候虽然是第一次但是有给DB打PSU的经验倒是稍微折腾了下。

这次确是一个小补丁,根据readme提到的,其中一步是需要关闭Management Agent,这个地方纠结了好一会。

这个Management Agent指的是哪个?

正常来讲,出现问题的节点在于数据库服务器上的agent端,所以应该是打在数据库服务器上的agent上,但是,

这个management的单词让我觉得是oms上的agent端,并且如果是数据库服务器上的agent上那岂不是有很多台的agent都要关掉打上?

而且说实话,oms上的agent是否和数据库服务器上的agent是一样的我都不确定(后来确定是一样的)。

又是一阵度娘和mos,这次就找不出来啥了。

后来又想到,其实在oms刚刚搭建完成后,默认在网页管理的目标“主机”就有了oms服务器本身,那其实无论是oms的agent还是db服务器上的agent,

本质上应该是一个东西,于是尝试在oms上将agent停掉,

$AGENT_HOME/bin/emctl stop agent

果然,oem的网页还是可以登陆的,目标“主机”处oms本身的机器已经处于不健康的状态,看来确实是一样的。

也就是,全部的agent都需要一个一个打上补丁......

后边有想到一个问题,是否在oms上的agent打上补丁后,之后就算新推送到其他服务器上的agent估计就是带上了新打的补丁了呢?

话不多说,先给oms的agent打上补丁,在推一个新的agent到未监控的db服务器上看看情况就知道了。

首先,一定要先读补丁的readme,按照里边的要求一步一步来!!!

第一,需要给agent的OPatch版本升级,由于oms的agent之前打psu的时候已经升级过了,因此这一步不再需要做。

第二,设置环境变量,

[oracle@oem13c agent]$ export ORACLE_HOME=/u01/app/oem13c/agent/agent_13.3.0.0.0
[oracle@oem13c agent]$ /u01/app/oem13c/agent/agent_13.3.0.0.0/OPatch/opatch version
OPatch Version: 13.9.3.3.0

OPatch succeeded.

这里扯点其他的,readme管这个目录/u01/app/oem13c/agent/agent_13.3.0.0.0叫agent core home,实际上,

环境变量AGENT_HOME设置的值为/u01/app/oem13c/agent/agent_inst,这个值在推送客户端的时候叫instance directory,

其中,/u01/app/oem13c/agent为agent的base目录,设置为AGENT_HOME=/u01/app/oem13c/agent/agent_inst原因是emctl命令在这个目录下的bin文件夹中。

实际上打小补丁的应用目录是agent core home。

继续回到打补丁这里,

第三,关闭agent,

[oracle@oem13c 28406747]$ export PATH=$ORACLE_HOME/bin:$ORACLE_HOME/OPatch:$PATH
[oracle@oem13c 28406747]$ emctl stop agent
Oracle Enterprise Manager Cloud Control 13c Release 3  
Copyright (c) 1996, 2018 Oracle Corporation.  All rights reserved.
Stopping agent ... stopped.
[oracle@oem13c 28406747]$ opatch lspatches
25237184;One-off
24470104;

OPatch succeeded.

第四,直接应用补丁即可,

[oracle@oem13c 28406747]$ opatch apply
Oracle Interim Patch Installer version 13.9.3.3.0
Copyright (c) 2020, Oracle Corporation.  All rights reserved.


Oracle Home       : /u01/app/oem13c/agent/agent_13.3.0.0.0
Central Inventory : /u01/app/oraInventory
   from           : /u01/app/oem13c/agent/agent_13.3.0.0.0/oraInst.loc
OPatch version    : 13.9.3.3.0
OUI version       : 13.9.1.0.0
Log file location : /u01/app/oem13c/agent/agent_13.3.0.0.0/cfgtoollogs/opatch/opatch2020-08-10_16-39-09PM_1.log


OPatch detects the Middleware Home as "/u01/app/oem13c/agent"

Verifying environment and performing prerequisite checks...
OPatch continues with these patches:   28406747  

Do you want to proceed? [y|n]
y
User Responded with: Y
All checks passed.
Backing up files...
Applying interim patch '28406747' to OH '/u01/app/oem13c/agent/agent_13.3.0.0.0'

Patching component oracle.sysman.agent.ic, 13.3.0.0.0...
Patch 28406747 successfully applied.
Log file location: /u01/app/oem13c/agent/agent_13.3.0.0.0/cfgtoollogs/opatch/opatch2020-08-10_16-39-09PM_1.log

OPatch succeeded.
[oracle@oem13c 28406747]$ opatch lspatches
28406747;
25237184;One-off
24470104;

OPatch succeeded.

最后,开启agent,

[oracle@oem13c 28406747]$ emctl start agent

至此,小补丁成功打上。

后边推送新的agent到未监控的db服务器上,发现推送后,db上的agent是没有新的补丁的...

所以还是要手动全部打一遍。

一样的步骤,不是特别复杂。

后续再观察"Timer-"线程的数量是否会再次异常以及是否还有报警产生。

原文地址:https://www.cnblogs.com/PiscesCanon/p/13469665.html