Hama——BSP、Graph教程

1. BSP

Hama提供纯BSP模型，支持消息传递与全局通信。BSP模型由一系列超步组成，每一个超步包括3个部分：

　　1）本地计算

　　2）进程通信

　　3）障栅同步

针对大量的科学计算问题，使用BSP模型可以编写高性能的并行计算算法。

通过继承 org.apache.hama.bsp.BSP 类，创建自己的BSP类。

继承类必须实现如下方法：

  public abstract void bsp(BSPPeer<K1, V1, K2, V2, M extends Writable> peer) throws IOException, SyncException, InterruptedException{}

每一个BSP程序有一些系列的超步组成，但是BSP方法只被调用一次，这一点与MapReduce有所不同。在计算的前后，可以选择实现setup()和cleanup()方法，对每次计算的数据作进一步处理。建议在计算结束或计算失败时执行cleanup()。

配置job：

  HamaConfiguration conf = new HamaConfiguration();
  BSPJob job = new BSPJob(conf, MyBSP.class);
  job.setJobName("My BSP program");
  job.setBspClass(MyBSP.class);
  job.setInputFormat(NullInputFormat.class);
  job.setOutputKeyClass(Text.class);
  ...
  job.waitForCompletion(true);

用户接口　　

输入输出

对BSPJob进行设置时，输入输出路径形式如下：

 job.setInputPath(new Path("/tmp/sequence.dat");
  job.setInputFormat(org.apache.hama.bsp.SequenceFileInputFormat.class);
  or,
  SequenceFileInputFormat.addInputPath(job, new Path("/tmp/sequence.dat"));
  or,
  SequenceFileInputFormat.addInputPaths(job, "/tmp/seq1.dat,/tmp/seq2.dat,/tmp/seq3.dat");
  
  job.setOutputKeyClass(Text.class);
  job.setOutputValueClass(IntWritable.class);
  job.setOutputFormat(TextOutputFormat.class);
  FileOutputFormat.setOutputPath(job, new Path("/tmp/result"));

以上三种方式可以任选一种作为输入代码。

然后，是对输入的数据的读取和输出数据。BSP创建一个方法，以BSPPeer作为参数。BSPPeer包含了通信、计数器和IO接口。读取一个文件，代码如下：

@Override
  public final void bsp(
      BSPPeer<LongWritable, Text, Text, LongWritable, Text> peer)
      throws IOException, InterruptedException, SyncException {
      
      // this method reads the next key value record from file
      KeyValuePair<LongWritable, Text> pair = peer.readNext();

      // the following lines do the same:
      LongWritable key = new LongWritable();
      Text value = new Text();
      peer.readNext(key, value);
      
      // write
      peer.write(value, key);
  }

可以对输入文件进行重复读取：

for(int i = 0; i < 5; i++){
    LongWritable key = new LongWritable();
    Text value = new Text();
    while (peer.readNext(key, value)) {
       // read everything
    }
    // reopens the input
    peer.reopenInput() //***************
  }

通信：　　

方法	描述
send(String peerName, BSPMessage msg)	向另外一个peer发送消息
getCurrentMessage()	返回接收到的消息
getNumCurrentMessages()	返回接收到的消息数
sync()	障栅同步
getPeerName()	返回peer的名称
getAllPeerNames()	返回所有peer的名称
getSuperstepCount()	返回超步数

　　以上方法都比较灵活，下面是一个向所有peer传递消息的代码：

  @Override
  public void bsp(
      BSPPeer<NullWritable, NullWritable, Text, DoubleWritable, Text> peer)
      throws IOException, SyncException, InterruptedException {
    for (String peerName : peer.getAllPeerNames()) {
      peer.send(peerName, 
        new Text("Hello from " + peer.getPeerName(), System.currentTimeMillis()));
    }

    peer.sync();
  }

同步：

当所有的进程都进入同步状态，接下来将就进入下一个超步。需要注意的是，sync()方法并不是BSP Job的结束。如前所述，所有的通信方法都非常的灵活。例如，可以在一个for循环中执行sync()，这样就可以对迭代顺序进行控制。

 @Override
  public void bsp(
      BSPPeer<NullWritable, NullWritable, Text, DoubleWritable, Text> peer)
      throws IOException, SyncException, InterruptedException {
    for (int i = 0; i < 100; i++) {
      // send some messages
      peer.sync();
    }
  }

最后，给出一个求取PI值的完整例子：

  private static Path TMP_OUTPUT = new Path("/tmp/pi-" + System.currentTimeMillis());

  public static class MyEstimator extends
      BSP<NullWritable, NullWritable, Text, DoubleWritable, DoubleWritable> {
    public static final Log LOG = LogFactory.getLog(MyEstimator.class);
    private String masterTask;
    private static final int iterations = 10000;

    @Override
    public void bsp(
        BSPPeer<NullWritable, NullWritable, Text, DoubleWritable, DoubleWritable> peer)
        throws IOException, SyncException, InterruptedException {

      int in = 0;
      for (int i = 0; i < iterations; i++) {
        double x = 2.0 * Math.random() - 1.0, y = 2.0 * Math.random() - 1.0;
        if ((Math.sqrt(x * x + y * y) < 1.0)) {
          in++;
        }
      }

      double data = 4.0 * in / iterations;

      peer.send(masterTask, new DoubleWritable(data));
      peer.sync();
    }

    @Override
    public void setup(
        BSPPeer<NullWritable, NullWritable, Text, DoubleWritable, DoubleWritable> peer)
        throws IOException {
      // Choose one as a master
      this.masterTask = peer.getPeerName(peer.getNumPeers() / 2);
    }

    @Override
    public void cleanup(
        BSPPeer<NullWritable, NullWritable, Text, DoubleWritable, DoubleWritable> peer)
        throws IOException {
      if (peer.getPeerName().equals(masterTask)) {
        double pi = 0.0;
        int numPeers = peer.getNumCurrentMessages();
        DoubleWritable received;
        while ((received = peer.getCurrentMessage()) != null) {
          pi += received.get();
        }

        pi = pi / numPeers;
        peer.write(new Text("Estimated value of PI is"), new DoubleWritable(pi));
      }
    }
  }

  static void printOutput(HamaConfiguration conf) throws IOException {
    FileSystem fs = FileSystem.get(conf);
    FileStatus[] files = fs.listStatus(TMP_OUTPUT);
    for (int i = 0; i < files.length; i++) {
      if (files[i].getLen() > 0) {
        FSDataInputStream in = fs.open(files[i].getPath());
        IOUtils.copyBytes(in, System.out, conf, false);
        in.close();
        break;
      }
    }

    fs.delete(TMP_OUTPUT, true);
  }

  public static void main(String[] args) throws InterruptedException,
      IOException, ClassNotFoundException {
    // BSP job configuration
    HamaConfiguration conf = new HamaConfiguration();

    BSPJob bsp = new BSPJob(conf, PiEstimator.class);
    // Set the job name
    bsp.setJobName("Pi Estimation Example");
    bsp.setBspClass(MyEstimator.class);
    bsp.setInputFormat(NullInputFormat.class);
    bsp.setOutputKeyClass(Text.class);
    bsp.setOutputValueClass(DoubleWritable.class);
    bsp.setOutputFormat(TextOutputFormat.class);
    FileOutputFormat.setOutputPath(bsp, TMP_OUTPUT);

    BSPJobClient jobClient = new BSPJobClient(conf);
    ClusterStatus cluster = jobClient.getClusterStatus(true);

    if (args.length > 0) {
      bsp.setNumBspTask(Integer.parseInt(args[0]));
    } else {
      // Set to maximum
      bsp.setNumBspTask(cluster.getMaxTasks());
    }

    long startTime = System.currentTimeMillis();
    if (bsp.waitForCompletion(true)) {
      printOutput(conf);
      System.out.println("Job Finished in "
          + (System.currentTimeMillis() - startTime) / 1000.0 + " seconds");
    }
  }

2. Graph

hama提供了Graph包，支持顶点为中心的图计算，使用较少的代码就可以实现google Pregel风格的应用。

Vertex API

实现一个Hama Graph应用包括对预定义的Vertex类进行子类化，模板参数涉及3种类型，顶点、边和消息（vertices, edges, and messages）：

public abstract class Vertex<V extends Writable, E extends Writable, M extends Writable>
      implements VertexInterface<V, E, M> {

    public void compute(Iterator<M> messages) throws IOException;
    ..

  }

用户重写compute()方法，该方法将在每个超步的活跃顶点中执行。Compute()方法可以查询当前顶点及其边的信息，并向其他顶点发送消息。

VertexReader API

通过继承 org.apache.hama.graph.VertexInputReader 类，根据自己的文件格式创建自己的 VertexReader，示例如下：

  public static class PagerankTextReader extends
      VertexInputReader<LongWritable, Text, Text, NullWritable, DoubleWritable> {

    /**
     * 输入文件的格式
     * The text file essentially should look like: <br/>
     * VERTEX_ID\t(n-tab separated VERTEX_IDs)<br/>
     * E.G:<br/>
     * 1\t2\t3\t4<br/>
     * 2\t3\t1<br/>
     * etc.
     */
    @Override
  /***
   * 解析节点，如hadoop类似，以行为一个单位进行输入。以制表符作为分割符，
   * 将每一行分割为String类型的数组，最后转化为vertex类的一个实例
   */
    public boolean parseVertex(LongWritable key, Text value,
        Vertex<Text, NullWritable, DoubleWritable> vertex) throws Exception {
      String[] split = value.toString().split("\t");
      for (int i = 0; i < split.length; i++) {
        if (i == 0) {
          vertex.setVertexID(new Text(split[i]));
        } else {
          vertex
              .addEdge(new Edge<Text, NullWritable>(new Text(split[i]), null));
        }
      }
      return true;
    }

  }

PageRank的例子，很简单，不解释了：

public static class PageRankVertex extends
      Vertex<Text, NullWritable, DoubleWritable> {

    @Override
    public void compute(Iterator<DoubleWritable> messages) throws IOException {
      if (this.getSuperstepCount() == 0) {
        this.setValue(new DoubleWritable(1.0 / (double) this.getNumVertices()));
      }

      if (this.getSuperstepCount() >= 1) {
        double sum = 0;
        while (messages.hasNext()) {
          DoubleWritable msg = messages.next();
          sum += msg.get();
        }

        double ALPHA = (1 - 0.85) / (double) this.getNumVertices();
        this.setValue(new DoubleWritable(ALPHA + (0.85 * sum)));
      }

      if (this.getSuperstepCount() < this.getMaxIteration()) {
        int numEdges = this.getOutEdges().size();
        sendMessageToNeighbors(new DoubleWritable(this.getValue().get()
            / numEdges));
      }
    }
  }

参考资料：

1、http://hama.apache.org/hama_bsp_tutorial.html

2、http://hama.apache.org/hama_graph_tutorial.html

转载请保留：http://www.cnblogs.com/Deron/archive/2013/06/09/3128135.html