对mahout与hadoop的调用关系分析,以及建立统一化平台的思路

mahout的bin目录下mahout文件的执行程序的代码

 if [ "$MAHOUT_JOB" = "" ] ; then
    echo "ERROR: Could not find mahout-examples-*.job in $MAHOUT_HOME or $MAHOUT_HOME/examples/target, please run 'mvn install' to create the .job file"
    exit 1
  else
    case "$1" in
    (hadoop)
      shift
      export HADOOP_CLASSPATH=$MAHOUT_CONF_DIR:${HADOOP_CLASSPATH}:$CLASSPATH
      exec "$HADOOP_BINARY" "$@"
      ;;
    (classpath)
      echo $CLASSPATH
      ;;
    (*)
      echo "MAHOUT-JOB: $MAHOUT_JOB"
      export HADOOP_CLASSPATH=$MAHOUT_CONF_DIR:${HADOOP_CLASSPATH}
      exec "$HADOOP_BINARY" jar $MAHOUT_JOB $CLASS "$@"
    esac
  fi

显而易见的mahout在调用hadoop跑作业

再看hadoop的bin目录下hadoop文件的执行部分代码

 exec "$JSVC" -Dproc_$COMMAND -outfile "$JSVC_OUTFILE" 
               -errfile "$JSVC_ERRFILE" 
               -pidfile "$HADOOP_SECURE_DN_PID" 
               -nodetach 
               -user "$HADOOP_SECURE_DN_USER" 
               -cp "$CLASSPATH" 
               $JAVA_HEAP_MAX $HADOOP_OPTS 
               org.apache.hadoop.hdfs.server.datanode.SecureDataNodeStarter "$@"
else
  # run it
  exec "$JAVA" -Dproc_$COMMAND $JAVA_HEAP_MAX $HADOOP_OPTS -classpath "$CLASSPATH" $CLASS "$@"

hadoop在调用JSVC或者用java加载java的类

然而有个问题,部分的mahout算法是不依赖的hadoop的

if [ ! -x "$HADOOP_BINARY" ] || [ "$MAHOUT_LOCAL" != "" ] ; then
  if [ ! -x "$HADOOP_BINARY" ] ; then
    echo "hadoop binary is not in PATH,HADOOP_HOME/bin,HADOOP_PREFIX/bin, running locally"
  elif [ "$MAHOUT_LOCAL" != "" ] ; then
    echo "MAHOUT_LOCAL is set, running locally"
  fi
    CLASSPATH="${CLASSPATH}:${MAHOUT_HOME}/lib/hadoop/*"
    case $1 in
    (classpath)
      echo $CLASSPATH
      ;;
    (*)
      exec "$JAVA" $JAVA_HEAP_MAX $MAHOUT_OPTS -classpath "$CLASSPATH" $CLASS "$@"
    esac
else
  echo "Running on hadoop, using $HADOOP_BINARY and HADOOP_CONF_DIR=$HADOOP_CONF_DIR"
不依赖与hadoop就直接使用java加载类执行算法了


 不登录服务器那么如何将mahout、hadoop统一使用呢?

一个快捷省事的思路是写一个站点,将使用的shell命令以web参数的形式提交,然后执行

麻烦一些的话用java程序替代mahout和hadoop中现有的shell脚本直接对jar功能包进行管理,但是这需要对二者内部机制有一定的了解,至少要熟读二者的执行脚本


完全倾向于第一个思路,特别是对java并不熟练,也不想去读shell脚本

不过第二种方法扩展性更好


原文地址:https://www.cnblogs.com/AI001/p/3996909.html