记一次CMS unloading class 耗时长调查

2020-09-14T15:02:14.773+0800: 262550.148: [GC (Allocation Failure) 2020-09-14T15:02:14.773+0800: 262550.148: [ParNew: 292225K->5384K(306688K), 0.0080812 secs] 1913661K->1626821K(2063104K), 0.0082840 secs] [Times: user=0.03 sys=0.00, real=0.01 secs] 
CMS: abort preclean due to time 2020-09-14T15:02:15.293+0800: 262550.667: [CMS-concurrent-abortable-preclean: 4.087/5.007 secs] [Times: user=5.12 sys=0.00, real=5.01 secs] 
2020-09-14T15:02:15.311+0800: 262550.685: [GC (CMS Final Remark) [YG occupancy: 20470 K (306688 K)]2020-09-14T15:02:15.311+0800: 262550.685: [Rescan (parallel) , 0.0100449 secs]2020-09-14T15:02:15.321+0800: 262550.695: [weak refs processing, 0.0126935 secs]2020-09-14T15:02:15.334+0800: 262550.708: 
[class unloading, 37.4731903 secs]2020-09-14T15:02:52.807+0800: 262588.181: [scrub symbol table, 3.7824486 secs]2020-09-14T15:02:56.589+0800: 262591.964: [scrub string table, 0.0028450 secs][1 CMS-remark: 1621436K(1756416K)] 1641906K(2063104K), 41.3046642 secs] [Times: user=0.60 sys=0.00, real=41.30 secs] 
2020-09-14T15:02:56.616+0800: 262591.991: [CMS-concurrent-sweep-start]

从上面的日志可以看出，unloading class 的耗时30s+。

官方文档解释：https://blogs.oracle.com/poonam/long-class-unloading-pauses-with-jdk8

In this case too, the problem seemed not be with the JVM or more specifically with the class unloading step but more with the GC threads getting blocked out. It turned out that the class unloading phase was causing lot of paging activity to occur at the system level which took away the CPU time from the GC threads.

这个例子也一样，这个问题和jvm无关，更具体的说和类卸载无关。而是和GC线程阻塞有关。最后发现类卸载阶段产生了很多的页活动，导致把cpu时间从GC线程抢走。

In this particular case, the system was configured to have higher number of large pages and a limited memory resource was left for the regular small pages. And a limited number of regular pages available for the MetaSpace caused paging activity during the class unloading phase leading to long GC pauses. The solution for this problem was to use -XX:+UseLargePagesInMetaspace that enables the large pages support for the MetaSpace. This avoided the paging activity and the class-unloading times returned to normal.

这个特殊的例子，系统（os?）配置了大量的大内存页，导致留给正常内存页的内存减少。给元生代的正常内存页的数量有限，导致页活动很频繁，进而导致GC线程中断。解决方案：使用-XX:+UseLargePageInMetaspace, 开启元生代使用大内存页。

问题分析，从cpu使用时间开始入手，real的耗时是所有时间，包括user+sys+IO等待时间，paging/swapping等。为什么real那么高？查看同一时刻的服务器运行状况，发现大量的swapping 操作。检查系统配置i，发现系统开启了swap功能。至此，问题大概明白了。java应用的堆设置的很大，3天左右才会发生一次Major GC。操作系统为了保证系统的安全，会把长期不活跃的数据（垃圾数据）移到Swap。等到GC到来时，就需要把这些数据swap in，产生Swap洪峰，延长了GC耗时。

目前，内存容量够大，也便宜，建议关闭swap功能。以下两篇文章也是建议关闭swap:

https://dzone.com/articles/just-say-no-swapping

https://www.elastic.co/guide/en/elasticsearch/reference/current/setup-configuration-memory.html