OpLogMergeMessage-OutOfMemoryError-JavaHeapSpace

  • direct memory size

  • netty or oplog

  • 5.5kw * 20

  • 60G worker/ 26G MaxDirectMemorySize

  • 1/2 tasks per worker both error

  • some tasks can work well

  • because of memory and multithreads pattern caused by resource scrambling

  • gc-log:

2018-11-09T14:10:47.973+0800: 7393.560: [CMS-concurrent-sweep: 0.241/0.241 secs] [Times: user=0.48 sys=0.00, real=0.24 secs] 
2018-11-09T14:10:47.973+0800: 7393.560: [CMS-concurrent-reset-start]
2018-11-09T14:10:48.038+0800: 7393.625: [CMS-concurrent-reset: 0.065/0.065 secs] [Times: user=0.13 sys=0.00, real=0.07 secs] 
2018-11-09T14:10:50.038+0800: 7395.625: [GC (CMS Initial Mark) [1 CMS-initial-mark: 25626226K(26204160K)] 39382689K(40762048K), 0.0139416 secs] [Times: user=0.02 sys=0.01, real=0.01 secs] 
2018-11-09T14:10:50.052+0800: 7395.639: [CMS-concurrent-mark-start]
2018-11-09T14:10:50.427+0800: 7396.014: [CMS-concurrent-mark: 0.375/0.375 secs] [Times: user=2.59 sys=0.02, real=0.37 secs] 
2018-11-09T14:10:50.427+0800: 7396.014: [CMS-concurrent-preclean-start]
2018-11-09T14:10:50.457+0800: 7396.044: [CMS-concurrent-preclean: 0.030/0.030 secs] [Times: user=0.06 sys=0.00, real=0.03 secs] 
2018-11-09T14:10:50.457+0800: 7396.044: [CMS-concurrent-abortable-preclean-start]
2018-11-09T14:10:50.457+0800: 7396.044: [CMS-concurrent-abortable-preclean: 0.000/0.000 secs] [Times: user=0.00 sys=0.00, real=0.00 secs] 
2018-11-09T14:10:50.458+0800: 7396.045: [GC (CMS Final Remark) [YG occupancy: 13756466 K (14557888 K)]2018-11-09T14:10:50.458+0800: 7396.045: [GC (CMS Final Remark) 2018-11-09T14:10:50.458+0800: 7396.045: [ParNew: 13756466K->13756466K(14557888K), 0.0000233 secs] 39382693K->39382693K(40762048K), 0.0000914 secs] [Times: user=0.00 sys=0.00, real=0.00 secs] 
2018-11-09T14:10:50.458+0800: 7396.045: [Rescan (parallel) , 0.0138796 secs]2018-11-09T14:10:50.472+0800: 7396.059: [weak refs processing, 0.0000406 secs]2018-11-09T14:10:50.472+0800: 7396.059: [class unloading, 0.0087389 secs]2018-11-09T14:10:50.481+0800: 7396.068: [scrub symbol table, 0.0055956 secs]2018-11-09T14:10:50.487+0800: 7396.074: [scrub string table, 0.0005615 secs][1 CMS-remark: 25626226K(26204160K)] 39382693K(40762048K), 0.0290641 secs] [Times: user=0.30 sys=0.00, real=0.02 secs] 
2018-11-09T14:10:50.488+0800: 7396.075: [CMS-concurrent-sweep-start]
2018-11-09T14:10:50.729+0800: 7396.316: [CMS-concurrent-sweep: 0.241/0.241 secs] [Times: user=0.48 sys=0.00, real=0.24 secs] 
2018-11-09T14:10:50.729+0800: 7396.316: [CMS-concurrent-reset-start]
2018-11-09T14:10:50.794+0800: 7396.381: [CMS-concurrent-reset: 0.065/0.065 secs] [Times: user=0.13 sys=0.00, real=0.06 secs] 
2018-11-09T14:10:51.734+0800: 7397.321: [GC (Allocation Failure) 2018-11-09T14:10:51.734+0800: 7397.321: [ParNew: 14280769K->14280769K(14557888K), 0.0000297 secs]2018-11-09T14:10:51.734+0800: 7397.321: [CMS: 25626226K->25626226K(26204160K), 8.7144181 secs] 39906995K->39782608K(40762048K), [Metaspace: 37753K->37753K(38912K)], 8.7146944 secs] [Times: user=8.72 sys=0.00, real=8.72 secs] 
2018-11-09T14:11:00.449+0800: 7406.036: [Full GC (Allocation Failure) 2018-11-09T14:11:00.449+0800: 7406.036: [CMS: 25626226K->25626196K(26204160K), 6.1291271 secs] 39782608K->39782578K(40762048K), [Metaspace: 37753K->37753K(38912K)], 6.1292957 secs] [Times: user=6.13 sys=0.00, real=6.13 secs] 
2018-11-09T14:11:06.579+0800: 7412.166: [GC (CMS Initial Mark) [1 CMS-initial-mark: 25626196K(26204160K)] 39782578K(40762048K), 0.0017634 secs] [Times: user=0.01 sys=0.00, real=0.00 secs] 
2018-11-09T14:11:06.581+0800: 7412.168: [CMS-concurrent-mark-start]
2018-11-09T14:11:06.840+0800: 7412.427: [Full GC (Allocation Failure) 2018-11-09T14:11:06.840+0800: 7412.427: [CMS2018-11-09T14:11:07.867+0800: 7413.454: [CMS-concurrent-mark: 1.033/1.286 secs] [Times: user=5.11 sys=0.61, real=1.28 secs] 
 (concurrent mode failure): 26150484K->26150474K(26204160K), 7.8489326 secs] 40314100K->39782414K(40762048K), [Metaspace: 37784K->37784K(38912K)], 7.8491778 secs] [Times: user=11.81 sys=0.39, real=7.85 secs] 
2018-11-09T14:11:14.690+0800: 7420.277: [Full GC (Allocation Failure) 2018-11-09T14:11:14.690+0800: 7420.277: [CMS: 26150474K->26150474K(26204160K), 1.2736921 secs] 39782414K->39782404K(40762048K), [Metaspace: 37784K->37784K(38912K)], 1.2738487 secs] [Times: user=1.28 sys=0.00, real=1.27 secs] 

  • stdout
2018-11-09 14:09:01,703 INFO [pool-6-thread-1] com.tencent.angel.ml.factorizationmachinesWAIC.FMLearnerWAIC: for one batch with 400036 in 67002 ms..
2018-11-09 14:09:01,703 INFO [pool-6-thread-1] com.tencent.angel.ml.factorizationmachinesWAIC.FMLearnerWAIC: for one batch with stepRead is 0 ..
2018-11-09 14:09:05,694 INFO [pool-6-thread-1] com.tencent.angel.ml.factorizationmachinesWAIC.FMLearnerWAIC: dataBlock read finished with 41660237 ..
2018-11-09 14:09:07,408 INFO [pool-6-thread-1] com.tencent.angel.ml.factorizationmachinesWAIC.FMLearnerWAIC: Calculate Delta for update ...
2018-11-09 14:09:11,398 INFO [pool-6-thread-1] com.tencent.angel.ml.factorizationmachinesWAIC.FMLearnerWAIC: Begin to update parameter in PS ...
2018-11-09 14:09:11,406 INFO [pool-6-thread-1] com.tencent.angel.ml.factorizationmachinesWAIC.FMModel: Start to push w0 from PS ...
2018-11-09 14:11:06,588 FATAL [pool-5-thread-1] com.tencent.angel.psagent.matrix.oplog.cache.MatrixOpLogCache: merge OpLogMergeMessage [update=com.tencent.angel.ml.math.vector.DenseDoubleVector@77d861dc, toString()=OpLogMessage [matrixId=1, type=MERGE, context=com.tencent.angel.psagent.task.TaskContext@16aa8654TaskContext [index=0, matrix clocks=(matrixId=0,clock=2)(matrixId=1,clock=1)(matrixId=2,clock=1)], seqId=17]] falied, 
java.lang.OutOfMemoryError: Java heap space
	at it.unimi.dsi.fastutil.ints.Int2DoubleOpenHashMap.<init>(Int2DoubleOpenHashMap.java:158)
	at it.unimi.dsi.fastutil.ints.Int2DoubleOpenHashMap.<init>(Int2DoubleOpenHashMap.java:169)
	at com.tencent.angel.ml.math.vector.SparseDoubleVector.resize(SparseDoubleVector.java:495)
	at com.tencent.angel.ml.math.vector.SparseDoubleVector.plusBy(SparseDoubleVector.java:564)
	at com.tencent.angel.ml.math.vector.SparseDoubleVector.plusBy(SparseDoubleVector.java:555)
	at com.tencent.angel.ml.math.vector.SparseDoubleVector.plusBy(SparseDoubleVector.java:35)
	at com.tencent.angel.ml.math.matrix.RowbaseMatrix.plusBy(RowbaseMatrix.java:126)
	at com.tencent.angel.psagent.matrix.oplog.cache.MatrixOpLog.merge(MatrixOpLog.java:160)
	at com.tencent.angel.psagent.matrix.oplog.cache.MatrixOpLogCache$Merger.run(MatrixOpLogCache.java:444)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
2018-11-09 14:11:15,964 FATAL [pool-5-thread-2] com.tencent.angel.psagent.matrix.oplog.cache.MatrixOpLogCache: merge OpLogMergeMessage [update=com.tencent.angel.ml.math.vector.DenseDoubleVector@316dd3c4, toString()=OpLogMessage [matrixId=1, type=MERGE, context=com.tencent.angel.psagent.task.TaskContext@16aa8654TaskContext [index=0, matrix clocks=(matrixId=0,clock=2)(matrixId=1,clock=1)(matrixId=2,clock=1)], seqId=18]] falied, 
java.lang.OutOfMemoryError: Java heap space
	at it.unimi.dsi.fastutil.ints.Int2DoubleOpenHashMap.<init>(Int2DoubleOpenHashMap.java:158)
	at it.unimi.dsi.fastutil.ints.Int2DoubleOpenHashMap.<init>(Int2DoubleOpenHashMap.java:169)
	at com.tencent.angel.ml.math.vector.SparseDoubleVector.resize(SparseDoubleVector.java:495)
	at com.tencent.angel.ml.math.vector.SparseDoubleVector.plusBy(SparseDoubleVector.java:564)
	at com.tencent.angel.ml.math.vector.SparseDoubleVector.plusBy(SparseDoubleVector.java:555)
	at com.tencent.angel.ml.math.vector.SparseDoubleVector.plusBy(SparseDoubleVector.java:35)
	at com.tencent.angel.ml.math.matrix.RowbaseMatrix.plusBy(RowbaseMatrix.java:126)
	at com.tencent.angel.psagent.matrix.oplog.cache.MatrixOpLog.merge(MatrixOpLog.java:160)
	at com.tencent.angel.psagent.matrix.oplog.cache.MatrixOpLogCache$Merger.run(MatrixOpLogCache.java:444)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
2018-11-09 14:11:15,969 INFO [pool-5-thread-1] com.tencent.angel.psagent.PSAgent: psagent falied
2018-11-09 14:11:15,985 INFO [pool-5-thread-1] com.tencent.angel.worker.Worker: worker failed message : merge OpLogMergeMessage [update=com.tencent.angel.ml.math.vector.DenseDoubleVector@77d861dc, toString()=OpLogMessage [matrixId=1, type=MERGE, context=com.tencent.angel.psagent.task.TaskContext@16aa8654TaskContext [index=0, matrix clocks=(matrixId=0,clock=2)(matrixId=1,clock=1)(matrixId=2,clock=1)], seqId=17]] falied, Java heap space, send it to appmaster success
2018-11-09 14:11:15,985 INFO [pool-5-thread-1] com.tencent.angel.worker.Worker: start to close all modules in worker
2018-11-09 14:11:15,985 INFO [pool-5-thread-1] com.tencent.angel.worker.Worker: stop workerService
2018-11-09 14:11:15,985 INFO [pool-5-thread-1] com.tencent.angel.worker.WorkerService: stop rpc server
2018-11-09 14:11:15,985 INFO [pool-5-thread-1] com.tencent.angel.ipc.NettyServer: Stopping server on 20586
2018-11-09 14:11:15,985 ERROR [Worker Heartbeat] com.tencent.angel.worker.Worker: report to appmaster failed, err: 
java.lang.NullPointerException
	at com.tencent.angel.worker.Worker.heartbeat(Worker.java:341)
	at com.tencent.angel.worker.Worker.access$200(Worker.java:65)
	at com.tencent.angel.worker.Worker$1.run(Worker.java:303)
	at java.lang.Thread.run(Thread.java:745)
原文地址:https://www.cnblogs.com/suanec/p/9935539.html