学习Mahout(二)

继续上一篇博客。

这篇博客介绍如何跑一下mahout自带的Hello world程序

我将mahout 安装在/opt/hadoop/mahout-distribution-0.9

cd /opt/hadoop/mahout-distribution-0.9/examples/bin
vi cluster-syntheticcontrol.sh

搜索内容一个命令"curl"，由于我的ubuntu 没有安装curl命令，所以需要修改一下

原：

curl http://archive.ics.uci.edu/ml/databases/synthetic_control/synthetic_control.data -o ${WORK_DIR}/synthetic_control.data

修改成：

#curl http://archive.ics.uci.edu/ml/databases/synthetic_control/synthetic_control.data -o ${WORK_DIR}/synthetic_control.data
wget http://archive.ics.uci.edu/ml/databases/synthetic_control/synthetic_control.data
mv synthetic_control.data ${WORK_DIR}

这里其实就是到网上下载一个synthetic_control.data文件，我们使用wget工具就可以了。

运行脚本：

./cluster-syntheticcontrol.sh

显示一个选项，让你选择使用什么聚类算法来实现聚集。具体我也没有过多了解，这里我选择2
Please select a number to choose the corresponding clustering algorithm
1. canopy clustering
2. kmeans clustering
3. fuzzykmeans clustering
Enter your choice : 2

回车确认后，它就会执行：

/opt/hadoop/mahout-distribution-0.9/bin/mahout org.apache.mahout.clustering.syntheticcontrol.kmeans.Job

如果你展开/opt/hadoop/mahout-distribution-0.9/bin/mahout 这个脚本，发现实际它也是在加载了环境变量之后，调用

${HADOOP_HOME}/bin/hadoop jar mahout-examples-0.9-job.jar 
               org.apache.mahout.driver.MahoutDriver  
               org.apache.mahout.clustering.syntheticcontrol.kmeans.Job

当然，在跑MapReduce 程序之前，它会先将data文件放在 HDFS 上。默认是放在/user/${user}/testdata 路径下。

程序的结果放在 HDFS 的 /user/${user}/output/clusteredPoints/part-m-00000。这个文件不能直接打开查看，还需要转换一下格式。

bin/mahout  seqdumper --input /user/root/output/clusteredPoints/ --output chenfool.txt

它会将聚集的结果写到本地，这里是chenfool.txt文件