To execute Mr.LDA

The key to get the visible data is to covert the outcomes to proper format ,which is in HDFS ( Mr.LDA on hadoop ) . The methods in details is in original Mr.LDA , which can be used by referring to README.md . The main steps to train the corpus are following :

1.prepare corpus

Two points must be paid attention to.

Firstly , the format of corpus is same as lda-c . Therefore , we have convert corpus to proper format by coding .
Secondly , to be dealt with on hadoop , the corpus should be processed again . However , the code is available on original Mr.LDA and what we should do is write a sh file like this :

$ hadoop jar target/mrlda-0.9.0-SNAPSHOT-fatjar.jar cc.mrlda.ParseCorpus 
    -input ap-sample.txt -output ap-sample-parsed

A complete will separated into several parts by property like this:

$ hadoop fs -ls ap-sample-parsed
ap-sample-parsed/document
ap-sample-parsed/term
ap-sample-parsed/title

Then which the corpus we use to run Mr.LDA is coming from this folder .

2.Run "vanilla" LDA

This step costs much time about 1 or 2 hours , using nohup command .
Set some parameters and run it like this :

$ nohup hadoop jar target/mrlda-0.9.0-SNAPSHOT-fatjar.jar 
    cc.mrlda.VariationalInference 
    -input ap-sample-parsed/document -output ap-sample-lda 
    -term 10000 -topic 20 -iteration 50 -mapper 50 -reducer 20 >& lda.log &

3.convert outcomes to proper format

The outcomes processed in the HDFS and isn't visible . If we want to get the visible data , we must convert it to proper format .
Being considerable , the method to convert format need SciPy module in Python , which is used to read data from matlab and similar data . To add the module we only need to type :

$ sudo apt-get install python-scipy

Then we can see the alpha id and beta file in the terminal by using original Mr.LDA . Some questions occur here , which is how to get beta alpha and other files as final outcomes .

z. About evaluation of machine learning

The key to evaluation of any machine learning algorithm is to split the corpus into three dataset : training set , development set , and test set . The training set is used to fit the model , the development set is used to select parameters , and the test set is used for evaluation . For this task , since we do not focus on tuning parameters , we use only the training set and test set .