尝试在CentOS 7上启动cosbench遭遇的一些问题 之三

当cosbench的测试莫名其妙的terminated了,而且时而发生,时而不发生,mission log里也看不出什么信息,记得看一眼system log.


如果发现这个call stack, 那么请注意,很可能这次测试的失败是由于controller和drivers与storage cluster 之间的时间不同步引起的。

2020-08-13 02:19:28,977 [ERROR] [AbstractAgent] - unexpected exception

java.lang.ArrayIndexOutOfBoundsException: -9626

at com.intel.cosbench.bench.Counter.doAdd(Counter.java:65)

at com.intel.cosbench.driver.model.OperatorContext.doAddSample(OperatorContext.java:76)

at com.intel.cosbench.driver.model.OperatorContext.addSample(OperatorContext.java:70)

at com.intel.cosbench.driver.agent.WorkAgent.onSampleCreated(WorkAgent.java:211)

at com.intel.cosbench.driver.operator.Preparer.operate(Preparer.java:99)

at com.intel.cosbench.driver.operator.AbstractOperator.operate(AbstractOperator.java:76)

at com.intel.cosbench.driver.agent.WorkAgent.performOperation(WorkAgent.java:197)

at com.intel.cosbench.driver.agent.WorkAgent.doWork(WorkAgent.java:177)

at com.intel.cosbench.driver.agent.WorkAgent.execute(WorkAgent.java:134)

at com.intel.cosbench.driver.agent.AbstractAgent.call(AbstractAgent.java:44)

at com.intel.cosbench.driver.agent.AbstractAgent.call(AbstractAgent.java:1)

at java.util.concurrent.FutureTask.run(FutureTask.java:266)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

at java.lang.Thread.run(Thread.java:748)

2020-08-13 02:19:28,977 [ERROR] [MissionHandler] - detected workers [19, 20, 21, 22, 23, 24] have encountered errors

2020-08-13 02:19:28,979 [INFO] [MissionHandler] - mission M2E66EA747D has been terminated


当你在controller的system.log中发现如下的记录,那么说明这次测试的terminate很可能是由于controller与drivers之间的时间不同步引起的。

2020-08-20 10:44:59,277 [WARN] [PingDriverRunner] - The driver driver1 at http://10.246.21.82:18088/driver is not reachable at the 1 time, with error message: Connection refused (Connection refused)
2020-08-20 11:22:57,348 [WARN] [AbstractCommandTasklet] - time drift is still longer than tolerable time drift 300 mSec after 3 times of synchronization

2020-08-20 17:47:37,351 [ERROR] [AbstractCommandTasklet] - driver report error: HTTP 400 - no such key defined: sizes

2020-08-20 17:47:37,359 [ERROR] [StageRunner] - detected tasks [t7, t8, t9, t10, t11, t12] have encountered errors
2020-08-20 17:47:37,365 [ERROR] [AbstractCommandTasklet] - driver report error: HTTP 400 - unrecognized request: org.apache.catalina.connector.RequestFacade@28481cc4

2020-08-20 17:47:37,366 [ERROR] [Aborter] - fail to abort driver
com.intel.cosbench.controller.tasklet.TaskletException
     at com.intel.cosbench.controller.tasklet.AbstractCommandTasklet.issueCommand(AbstractCommandTasklet.java:81)
     at com.intel.cosbench.controller.tasklet.Aborter.executeAbort(Aborter.java:53)
     at com.intel.cosbench.controller.tasklet.Aborter.execute(Aborter.java:42)
     at com.intel.cosbench.controller.tasklet.AbstractTasklet.call(AbstractTasklet.java:47)
     at com.intel.cosbench.controller.tasklet.AbstractTasklet.call(AbstractTasklet.java:1)
     at java.util.concurrent.FutureTask.run(FutureTask.java:266)
     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
     at java.lang.Thread.run(Thread.java:748)


进一步的排查,可以使用下面的命令,来让controller和drivers同时返回本地时间,一边让时间的差距一目了然。如果不这么做,则很难明确几台机器上的的时间差距是不是输入间隔命令的那几秒造成的。

# date && ssh root@10.246.21.82 date && ssh root@10.246.21.83 date


首先,确保controller与driver在同一个时区之内。

image

可以看到这台controller的时区是UTC,而我们应该改成与其他drivers一样的New_York.

# timedatectl list-timezones | grep York

# timedatectl set-zimezone America/New_York


使用下面的命令来在CentOS 7上进行time sync.

先检查NTP的状态:

image


修改NTP的配置文件。

# vi /etc/ntp.conf

添加一条本地的NTP的服务器的信息,如下的两行:

server 172.16.199.1

server 10.254.140.22


检查ntp服务的状态:

# systemctl status ntpd

举例:

image


停掉ntp服务:

# systemctl stop ntpd

如果不停掉ntp服务的话,是没办法与服务器同步时间的。会报错:”the NTP socket is in use, exiting”


检查ntp服务的状态:

image


强制时间与ntp服务器同步。

# ntpdate 10.254.140.22

或者

# ntpd -gq

下图就是一个时间同步成功了之后的输出。

image

或:

image


再启动ntp服务。

# systemctl start ntpd.service

再检查一下NTP服务的状态,可以看到time已经sync了。

image


参考资料

==============

https://github.com/intel-cloud/cosbench/issues/264

https://www.thegeekdiary.com/centos-rhel-how-to-configure-ntp-server-and-client/

https://www.golinuxhub.com/2017/12/how-to-forcefully-sync-date-and-time/

https://www.thegeekdiary.com/centos-rhel-6-how-to-force-a-ntp-sync-with-the-ntp-servers/

原文地址:https://www.cnblogs.com/awpatp/p/13588732.html