spark Graphx 之 Connected Components

一、Connected Components算法

  Connected Components即连通体算法用id标注图中每个连通体,将连通体中序号最小的顶点的id作为连通体的id。如果在图G中,任意2个顶点之间都存在路径,那么称G为连通图,否则称该图为非连通图,则其中的极大连通子图称为连通体,如下图所示,该图中有两个连通体:

二、示例

followers.txt (起点id,终点id)

4 1
1 2
6 3
7 3
7 6
6 7
3 7

users.txt (id,first name,full name)

1,BarackObama,Barack Obama
2,ladygaga,Goddess of Love
3,jeresig,John Resig
4,justinbieber,Justin Bieber
6,matei_zaharia,Matei Zaharia
7,odersky,Martin Odersky
8,anonsys
import org.apache.spark.graphx.{Graph, GraphLoader, VertexId, VertexRDD}
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}

object Connected_Components {
  def main(args: Array[String]): Unit = {
    val conf: SparkConf = new SparkConf().setAppName(this.getClass.getSimpleName).setMaster("local")
    val sc: SparkContext = new SparkContext(conf)
    //读取followers.txt文件创建图
    val graph: Graph[Int, Int] = GraphLoader.edgeListFile(sc,"src/main/resources/connected/followers.txt")
    //计算连通体
    val components: Graph[VertexId, Int] = graph.connectedComponents()
    val vertices: VertexRDD[VertexId] = components.vertices
    /**
     * vertices:
     * (4,1)
     * (1,1)
     * (6,3)
     * (3,3)
     * (7,3)
     * (2,1)
     * 是一个tuple类型,key分别为所有的顶点id,value为key所在的连通体id(连通体中顶点id最小值)
     */
    //读取users.txt文件转化为(key,value)形式
    val users: RDD[(VertexId, String)] = sc.textFile("src/main/resources/connected/users.txt").map(line => {
      val fields: Array[String] = line.split(",")
      (fields(0).toLong, fields(1))
    })
    /**
     * users:
     * (1,BarackObama)
     * (2,ladygaga)
     * (3,jeresig)
     * (4,justinbieber)
     * (6,matei_zaharia)
     * (7,odersky)
     * (8,anonsys)
     */
    users.join(vertices).map{
      case(id,(username,vertices))=>(vertices,username)
    }.groupByKey().map(t=>{
      t._1+"->"+t._2.mkString(",")
    }).foreach(println(_))
    /**
     * 得到结果为:
     * 1->justinbieber,BarackObama,ladygaga
     * 3->matei_zaharia,jeresig,odersky
     */
  }
}

最终计算得到这个关系网络有两个社区。

原文地址:https://www.cnblogs.com/sabertobih/p/13801855.html