spark mllib lda 简单示例

舆情系统每日热词用到了lda主题聚类

原先的版本是python项目，分词应用Jieba,LDA应用Gensim

项目工作良好

有以下几点问题

1 舆情产品基于elasticsearch大数据,es内应用lucene分词，python的jieba分词和lucene分词结果并不一致(或需额外的工作保持一致)，早期需求只是展示每日热词，分词不一致并不是个问题，现在的新的需求，要求lda和数据无缝结合，es集成jieba,再把es内的数据全用全量数据重新分词，考虑工作量和技术难度上都不现实，只好改lda的分词算法了(实际应用上，不同的分词算法在lda提取主题和热词的场景下几乎没有影响)

2 python项目限于单点计算，不好扩展

应用lucene分词，再计算lda,切换至java的技术栈是最简单的办法

java下也有很多lda的算法实现,只作可行性验证

这篇主要调研的是spark-mllib,官方有lda实现

http://spark.apache.org/docs/latest/mllib-clustering.html#latent-dirichlet-allocation-lda

ls examples/src/main/scala/org/apache/spark/examples/mllib/LDAExample.scala

root@wx-social-consume2:/usr/spark-2.4.0# ./bin/run-example mllib.LDAExample
2019-03-21 10:28:07 WARN Utils:66 - Your hostname, wx-social-consume2 resolves to a loopback address: 127.0.0.1; using 10.10.3.47 instead (on interface em1)
2019-03-21 10:28:07 WARN Utils:66 - Set SPARK_LOCAL_IP if you need to bind to another address
2019-03-21 10:28:07 WARN NativeCodeLoader:60 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Error: Missing argument <input>...
LDAExample: an example LDA app for plain text data.
Usage: LDAExample [options] <input>...

--k <value> number of topics. default: 20
--maxIterations <value> number of iterations of learning. default: 10
--docConcentration <value>
amount of topic smoothing to use (> 1.0) (-1=auto). default: -1.0
--topicConcentration <value>
amount of term (word) smoothing to use (> 1.0) (-1=auto). default: -1.0
--vocabSize <value> number of distinct word types to use, chosen by frequency. (-1=all) default: 10000
--stopwordFile <value> filepath for a list of stopwords. Note: This must fit on a single machine. default:
--algorithm <value> inference algorithm to use. em and online are supported. default: em
--checkpointDir <value> Directory for checkpointing intermediate results. Checkpointing helps with recovery and eliminates temporary shuffle files on disk. default: None
--checkpointInterval <value>
Iterations between each checkpoint. Only used if checkpointDir is set. default: 10
<input>... input paths (directories) to plain text corpora. Each text file line should hold 1 document.
2019-03-21 10:28:10 INFO ShutdownHookManager:54 - Shutdown hook called
2019-03-21 10:28:10 INFO ShutdownHookManager:54 - Deleting directory /tmp/spark-0022e162-fa2a-4e0e-8e89-c78ea7962dd1

官方示例
Refer to the LDA Scala docs and DistributedLDAModel Scala docs for details on the API.

import org.apache.spark.mllib.clustering.{DistributedLDAModel, LDA}
import org.apache.spark.mllib.linalg.Vectors

// Load and parse the data
val data = sc.textFile("data/mllib/sample_lda_data.txt")
val parsedData = data.map(s => Vectors.dense(s.trim.split(' ').map(_.toDouble)))
// Index documents with unique IDs
val corpus = parsedData.zipWithIndex.map(_.swap).cache()

// Cluster the documents into three topics using LDA
val ldaModel = new LDA().setK(3).run(corpus)

// Output topics. Each is a distribution over words (matching word count vectors)
println(s"Learned topics (as distributions over vocab of ${ldaModel.vocabSize} words):")
val topics = ldaModel.topicsMatrix
for (topic <- Range(0, 3)) {
print(s"Topic $topic :")
for (word <- Range(0, ldaModel.vocabSize)) {
print(s"${topics(word, topic)}")
}
println()
}

// Save and load model.
ldaModel.save(sc, "target/org/apache/spark/LatentDirichletAllocationExample/LDAModel")
val sameModel = DistributedLDAModel.load(sc,
"target/org/apache/spark/LatentDirichletAllocationExample/LDAModel")

文件为"data/mllib/sample_lda_data.txt"

root@wx-social-consume2:/usr/spark-2.4.0# ./bin/run-example mllib.LDAExample data/mllib/sample_lda_data.txt
2019-03-21 10:29:55 WARN Utils:66 - Your hostname, wx-social-consume2 resolves to a loopback address: 127.0.0.1; using 10.10.3.47 instead (on interface em1)
2019-03-21 10:29:55 WARN Utils:66 - Set SPARK_LOCAL_IP if you need to bind to another address
2019-03-21 10:29:56 WARN NativeCodeLoader:60 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2019-03-21 10:29:58 INFO SparkContext:54 - Running Spark version 2.4.0
2019-03-21 10:29:58 INFO SparkContext:54 - Submitted application: LDAExample with {
input: List(data/mllib/sample_lda_data.txt),
k: 20,
maxIterations: 10,
docConcentration: -1.0,
topicConcentration: -1.0,
vocabSize: 10000,
stopwordFile: ,
algorithm: em,
checkpointDir: None,
checkpointInterval: 10
}
2019-03-21 10:29:58 INFO SecurityManager:54 - Changing view acls to: root
2019-03-21 10:29:58 INFO SecurityManager:54 - Changing modify acls to: root
2019-03-21 10:29:58 INFO SecurityManager:54 - Changing view acls groups to:
2019-03-21 10:29:58 INFO SecurityManager:54 - Changing modify acls groups to:
2019-03-21 10:29:58 INFO SecurityManager:54 - SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with modify permissions: Set(root); groups with modify permissions: Set()
2019-03-21 10:29:58 INFO Utils:54 - Successfully started service 'sparkDriver' on port 63504.
2019-03-21 10:29:58 INFO SparkEnv:54 - Registering MapOutputTracker
2019-03-21 10:29:58 INFO SparkEnv:54 - Registering BlockManagerMaster
2019-03-21 10:29:58 INFO BlockManagerMasterEndpoint:54 - Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
2019-03-21 10:29:58 INFO BlockManagerMasterEndpoint:54 - BlockManagerMasterEndpoint up
2019-03-21 10:29:58 INFO DiskBlockManager:54 - Created local directory at /tmp/blockmgr-ac172871-a0e5-41af-9e76-a09ea01c3428
2019-03-21 10:29:58 INFO MemoryStore:54 - MemoryStore started with capacity 366.3 MB
2019-03-21 10:29:58 INFO SparkEnv:54 - Registering OutputCommitCoordinator
2019-03-21 10:29:59 INFO log:192 - Logging initialized @4022ms
2019-03-21 10:29:59 INFO Server:351 - jetty-9.3.z-SNAPSHOT, build timestamp: unknown, git hash: unknown
2019-03-21 10:29:59 INFO Server:419 - Started @4109ms
2019-03-21 10:29:59 WARN Utils:66 - Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
2019-03-21 10:29:59 INFO AbstractConnector:278 - Started ServerConnector@1386313f{HTTP/1.1,[http/1.1]}{0.0.0.0:4041}
2019-03-21 10:29:59 INFO Utils:54 - Successfully started service 'SparkUI' on port 4041.
2019-03-21 10:29:59 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@1921994e{/jobs,null,AVAILABLE,@Spark}
2019-03-21 10:29:59 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@1a8df0b3{/jobs/json,null,AVAILABLE,@Spark}
2019-03-21 10:29:59 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@7c112f5f{/jobs/job,null,AVAILABLE,@Spark}
2019-03-21 10:29:59 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@4fd05028{/jobs/job/json,null,AVAILABLE,@Spark}
2019-03-21 10:29:59 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@3a2d3909{/stages,null,AVAILABLE,@Spark}
2019-03-21 10:29:59 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@4fb392c4{/stages/json,null,AVAILABLE,@Spark}
2019-03-21 10:29:59 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@194d329e{/stages/stage,null,AVAILABLE,@Spark}
2019-03-21 10:29:59 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@541179e7{/stages/stage/json,null,AVAILABLE,@Spark}
2019-03-21 10:29:59 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@24386839{/stages/pool,null,AVAILABLE,@Spark}
2019-03-21 10:29:59 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@7b32b129{/stages/pool/json,null,AVAILABLE,@Spark}
2019-03-21 10:29:59 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@439e3cb4{/storage,null,AVAILABLE,@Spark}
2019-03-21 10:29:59 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@1c9fbb61{/storage/json,null,AVAILABLE,@Spark}
2019-03-21 10:29:59 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@7b81616b{/storage/rdd,null,AVAILABLE,@Spark}
2019-03-21 10:29:59 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@15d42ccb{/storage/rdd/json,null,AVAILABLE,@Spark}
2019-03-21 10:29:59 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@279dd959{/environment,null,AVAILABLE,@Spark}
2019-03-21 10:29:59 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@46383a78{/environment/json,null,AVAILABLE,@Spark}
2019-03-21 10:29:59 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@36c281ed{/executors,null,AVAILABLE,@Spark}
2019-03-21 10:29:59 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@244418a{/executors/json,null,AVAILABLE,@Spark}
2019-03-21 10:29:59 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@4b5a078a{/executors/threadDump,null,AVAILABLE,@Spark}
2019-03-21 10:29:59 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@4c361f63{/executors/threadDump/json,null,AVAILABLE,@Spark}
2019-03-21 10:29:59 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@6ed922e1{/static,null,AVAILABLE,@Spark}
2019-03-21 10:29:59 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@35f3a22c{/,null,AVAILABLE,@Spark}
2019-03-21 10:29:59 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@1a0c5e9{/api,null,AVAILABLE,@Spark}
2019-03-21 10:29:59 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@32e652b6{/jobs/job/kill,null,AVAILABLE,@Spark}
2019-03-21 10:29:59 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@4ba02375{/stages/stage/kill,null,AVAILABLE,@Spark}
2019-03-21 10:29:59 INFO SparkUI:54 - Bound SparkUI to 0.0.0.0, and started at http://10.10.3.47:4041
2019-03-21 10:29:59 INFO SparkContext:54 - Added JAR file:///usr/spark-2.4.0/examples/jars/scopt_2.11-3.7.0.jar at spark://10.10.3.47:63504/jars/scopt_2.11-3.7.0.jar with timestamp 1553164199210
2019-03-21 10:29:59 INFO SparkContext:54 - Added JAR file:///usr/spark-2.4.0/examples/jars/spark-examples_2.11-2.4.0.jar at spark://10.10.3.47:63504/jars/spark-examples_2.11-2.4.0.jar with timestamp 1553164199211
2019-03-21 10:29:59 INFO Executor:54 - Starting executor ID driver on host localhost
2019-03-21 10:29:59 INFO Utils:54 - Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 29540.
2019-03-21 10:29:59 INFO NettyBlockTransferService:54 - Server created on 10.10.3.47:29540
2019-03-21 10:29:59 INFO BlockManager:54 - Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
2019-03-21 10:29:59 INFO BlockManagerMaster:54 - Registering BlockManager BlockManagerId(driver, 10.10.3.47, 29540, None)
2019-03-21 10:29:59 INFO BlockManagerMasterEndpoint:54 - Registering block manager 10.10.3.47:29540 with 366.3 MB RAM, BlockManagerId(driver, 10.10.3.47, 29540, None)
2019-03-21 10:29:59 INFO BlockManagerMaster:54 - Registered BlockManager BlockManagerId(driver, 10.10.3.47, 29540, None)
2019-03-21 10:29:59 INFO BlockManager:54 - Initialized BlockManager: BlockManagerId(driver, 10.10.3.47, 29540, None)
2019-03-21 10:29:59 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@6587305a{/metrics/json,null,AVAILABLE,@Spark}

Corpus summary:
Training set size: 12 documents
Vocabulary size: 10 terms
Training set size: 62 tokens
Preprocessing time: 5.008004889 sec

Finished training LDA model. Summary:
Training time: 3.565145577 sec
Training data average log likelihood: -20.14862821928427

20 topics:
TOPIC 0
0 0.2776616245970018
1 0.2467437437143153
2 0.162799944092254
3 0.14357469330901143
4 0.06928101473026803
9 0.04699050355830519
5 0.030203661449368247
6 0.007640498460509008
7 0.007552251606432597
8 0.007552064482534436

TOPIC 1
0 0.36469051153508564
1 0.18646127739213272
2 0.15522432783704704
3 0.13106181033451642
4 0.06584596207691033
9 0.04395627892970295
5 0.030062711796482375
8 0.007597985215275549
7 0.00759765588135554
6 0.007501479001491501

TOPIC 2
0 0.31931887563375516
1 0.1962082806692423
2 0.17231677857837657
3 0.14159208272582544
4 0.07269228847838642
9 0.044804492369582984
5 0.030074171215304125
7 0.007707667829625457
8 0.007707603060281631
6 0.007577759439619957

TOPIC 3
0 0.33172427138380095
1 0.2099908538476618
2 0.1529315801366396
3 0.13055578109706914
4 0.07654434907166915
9 0.045380994421043694
5 0.03023101958436096
8 0.007550809491464853
7 0.007550747971239613
6 0.00753959299505035

TOPIC 4
0 0.32678177168416583
1 0.2207731082949569
2 0.1515085115377772
3 0.13531105489561282
4 0.06780498015453475
9 0.04539833136958728
5 0.02983504958532268
6 0.007557895637294794
7 0.007514688233324302
8 0.007514608607423405

TOPIC 5
0 0.3239146979457089
1 0.21573894993478243
2 0.14525808340440868
3 0.14073654896536467
4 0.07677900991703532
9 0.04501240647651081
5 0.03001971166224193
6 0.007555937737508214
7 0.00749238505224968
8 0.007492268904189478

TOPIC 6
0 0.31540967245137175
1 0.2455372478674326
2 0.15387618434880607
3 0.11776433445702658
4 0.06958508618823488
9 0.0454694806523595
5 0.029862323423880312
6 0.007550844343998941
7 0.007472470469363186
8 0.007472355797526257

TOPIC 7
0 0.3288373150630124
1 0.1984831879713159
2 0.14673351329227885
3 0.13721294614635954
4 0.09099655027113343
9 0.04408525595608986
5 0.031035301494144973
8 0.007542328434232405
7 0.007542260756614362
6 0.007531340614818258

TOPIC 8
0 0.3193586979466173
2 0.19372032290375615
1 0.17385196290359722
3 0.15067297324203144
4 0.06431584983802775
9 0.04459867057873216
5 0.030109171006041484
7 0.007889712095309039
8 0.007889665041634011
6 0.007592974444253261

TOPIC 9
0 0.30194242995872983
1 0.19005321586219182
2 0.18325896599060398
3 0.13814239333008232
4 0.08613769346253837
9 0.04615261626883708
5 0.031097388101704492
8 0.007811116080942582
7 0.00781095323650845
6 0.007593227707861054

TOPIC 10
0 0.3018333962608689
1 0.2101327107384649
2 0.17976426989184977
3 0.13028433691427393
4 0.0785628713958075
9 0.04589483819874114
5 0.030473519782385595
8 0.007732641246914153
7 0.007732610576531091
6 0.007588804994162942

TOPIC 11
0 0.2731844012758353
1 0.2437828502953355
2 0.15602289141674863
3 0.14124037120394928
4 0.08621787369766198
9 0.04595175938704448
5 0.03092529389836489
6 0.007624331986973666
8 0.007525230917852694
7 0.007524995920233587

TOPIC 12
0 0.2754166406308875
1 0.23790442921451685
2 0.16407059546150252
3 0.15224175554500483
4 0.06911748308147463
9 0.04812626048120002
5 0.030286734901316135
6 0.007658288924620451
8 0.007588920737731278
7 0.0075888910217458685

TOPIC 13
0 0.30106045484425825
1 0.24631760902734479
3 0.14410465573241527
2 0.12524822006978883
4 0.08639636802254201
9 0.04423754911967344
5 0.030456963732366213
6 0.007563767806759953
7 0.00730722647659661
8 0.0073071851682546506

TOPIC 14
0 0.31562044416659885
1 0.2526998226062497
2 0.14535277752245734
3 0.12218888456011873
4 0.0662622160602047
9 0.04569769393216447
5 0.02981990651641202
6 0.007553483942629782
7 0.007402409879010876
8 0.00740236081415358

TOPIC 15
0 0.32058662644832936
1 0.20394629563372832
2 0.14716518634378675
3 0.1428368219219527
4 0.08608013435759047
9 0.046286592753669746
5 0.030458243185553725
6 0.007557253951254248
8 0.007541483951171858
7 0.007541361452962736

TOPIC 16
0 0.34720661993137836
1 0.19335921917713203
2 0.1574642973682748
3 0.1315805755079897
4 0.07262519571107433
9 0.044460628102199806
5 0.03055492723901837
7 0.0076115120495338995
8 0.007611307835359196
6 0.0075257170780395

TOPIC 17
0 0.276113881630822
1 0.23014600960245993
2 0.1791490549120877
3 0.13887566068288787
4 0.0755460207226412
9 0.04662339718458672
5 0.03051544813778674
7 0.007695876502113754
8 0.007695757393934347
6 0.00763889323067976

TOPIC 18
0 0.3143607517778961
1 0.23372981830486098
2 0.15462561577264133
3 0.12748419732486796
4 0.07250495549430869
9 0.04461512010586714
5 0.030101883279393397
6 0.007561984484167818
8 0.007507862285458234
7 0.007507811170538331

TOPIC 19
0 0.27641365880548574
1 0.25811740868556055
2 0.15554256853273
3 0.1299929833331776
4 0.08206677473861888
9 0.04536961458219285
5 0.029948048909439938
6 0.007602171935254187
7 0.007473418458492894
8 0.007473352019047361

官方用例测试通过

./bin/run-example mllib.LDAExample data/mllib/sample_lda_data.txt --k=3

Corpus summary:
Training set size: 12 documents
Vocabulary size: 10 terms
Training set size: 62 tokens
Preprocessing time: 3.952237148 sec

Finished training LDA model. Summary:
Training time: 2.096613705 sec
Training data average log likelihood: -19.94356723200465

3 topics:
TOPIC 0
0 0.3859910534194794
1 0.27784612698328287
3 0.11309885992982183
2 0.07754302744583665
4 0.05944898611629052
9 0.03956337141492645
5 0.026961271416303733
6 0.00729179016688132
7 0.006153234509962162
8 0.006102278597215016

TOPIC 1
0 0.27760966762034767
2 0.20133458026264603
1 0.17674769647471528
3 0.17119003866520263
4 0.05856966790108555
9 0.054142861315096644
5 0.036444242827270906
6 0.008131736037023057
7 0.008064284551869805
8 0.007765224344742546

TOPIC 2
0 0.26781477963691924
1 0.20419276965972555
2 0.1988283635339889
3 0.12491741400170082
4 0.10934994319103938
9 0.0426866734196486
5 0.027519743427499636
8 0.008867804538854353
7 0.008517400571616986
6 0.007305108019006569

看sample_lda_data.txt的内容
root@wx-social-consume2:/usr/spark-2.4.0# cat data/mllib/sample_lda_data.txt
1 2 6 0 2 3 1 1 0 0 3
1 3 0 1 3 0 0 2 0 0 1
1 4 1 0 0 4 9 0 1 2 0
2 1 0 3 0 0 5 0 2 3 9
3 1 1 9 3 0 2 0 0 1 3
4 2 0 3 4 5 1 1 1 4 0
2 1 0 3 0 0 5 0 2 2 9
1 1 1 9 2 1 2 0 0 1 3
4 4 0 3 4 2 1 3 0 0 0
2 8 2 0 3 0 2 0 2 7 2
1 1 1 9 0 2 2 0 0 3 3
4 1 0 0 4 5 1 3 0 1 0

矩阵11行，怎么只输出了10列，理解有错？
TOPIC 2
0 0.26781477963691924
1 0.20419276965972555
2 0.1988283635339889
3 0.12491741400170082
4 0.10934994319103938
9 0.0426866734196486
5 0.027519743427499636
8 0.008867804538854353
7 0.008517400571616986
6 0.007305108019006569

看代码
val topicIndices = ldaModel.describeTopics(maxTermsPerTopic = 10)
是只取出了top10的词，理解应该没问题

离线单点计算