GraphX源碼解析（Graph構建過程）

0. Graph構建

Graph對象是用戶的操作入口，主要包含edge和vertex兩部分。邊是由點組成，所以邊中所有的點就是點的全集，但這個全集包含了重複的點，去重後就是VertexRDD。

1. 構建圖的方法

從邊的集合構建圖（Graph.fromEdges）

def fromEdges[VD: ClassTag, ED: ClassTag](
      edges: RDD[Edge[ED]],
      defaultValue: VD,
      edgeStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY,
      vertexStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY): Graph[VD, ED]

從源點和目的點的元組構建（Graph.fromEdgeTuples）

  def fromEdgeTuples[VD: ClassTag](
      rawEdges: RDD[(VertexId, VertexId)],
      defaultValue: VD,
      uniqueEdges: Option[PartitionStrategy] = None,
      edgeStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY,
      vertexStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY): Graph[VD, Int]

從具有屬性的頂點和邊的RDD構建（Graph())

  def apply[VD: ClassTag, ED: ClassTag](
      vertices: RDD[(VertexId, VD)],
      edges: RDD[Edge[ED]],
      defaultVertexAttr: VD = null.asInstanceOf[VD],
      edgeStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY,
      vertexStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY): Graph[VD, ED]

三種方法最後都是調用了伴生對象GraphImpl的apply()方法，主要包括edgeRDD和vertexRDD 的構建，vertexRDD是從edgeRDD基礎上構建起來的。

  def apply[VD: ClassTag, ED: ClassTag](
      vertices: RDD[(VertexId, VD)],
      edges: RDD[Edge[ED]],
      defaultVertexAttr: VD,
      edgeStorageLevel: StorageLevel,
      vertexStorageLevel: StorageLevel): GraphImpl[VD, ED] = {
    val edgeRDD = EdgeRDD.fromEdges(edges)(classTag[ED], classTag[VD])
      .withTargetStorageLevel(edgeStorageLevel)
    val vertexRDD = VertexRDD(vertices, edgeRDD, defaultVertexAttr)
      .withTargetStorageLevel(vertexStorageLevel)
    GraphImpl(vertexRDD, edgeRDD)
  }

2. 構建EdgeRDD

2.1 從HDFS加載文本文件

從分佈式文件系統（HDFS）中加載文本，按行處理成元組形式，即(srcId, dstId)。

    val rawEdgesRdd: RDD[(Long, Long)] = sc.textFile(input, partNum).repartition(partNum).map {
      case line =>
        val sd = line.split(",")
        val src = sd(0).toLong
        val dst = sd(1).toLong
    }.distinct()

數據格式如下：

0,1
2,3
4,1
5,1
8,2
3,5
...

2.2 詳細構建過程

第一步：Graph.fromEdge(edges)

首先從已經構建好的RDD[Edge[ED]]來開始整個EdgeRDD的構建。Edge在文件Edge.scala中定義，主要存儲了邊的三種類型數據：srcId, dstId, attr。

case class Edge[@specialized(Char, Int, Boolean, Byte, Long, Float, Double) ED] (
    var srcId: VertexId = 0,
    var dstId: VertexId = 0,
    var attr: ED = null.asInstanceOf[ED])
  extends Serializable

第二步：EdgeRDD.fromEdges(edges)

遍歷RDD[Edge[ED]]的所有分區，開始重新構建邊的存儲方式。

第三步：EdgePartitionBuilder[ED, VD]

EdgePartitionBuilder是邊的物理存儲結構，具體存儲結構的關係圖如下：

（勘誤：localDstIds表中最後一行數據的local值爲4，應該修改爲5）

源碼如下：

private[graphx]
class EdgePartitionBuilder[@specialized(Long, Int, Double) ED: ClassTag, VD: ClassTag](
    size: Int = 64) {
  private[this] val edges = new PrimitiveVector[Edge[ED]](size)

  /* 將一條邊加入進去*/
  def add(src: VertexId, dst: VertexId, d: ED) {
    edges += Edge(src, dst, d)
  }

  // 上述add執行完成後，會調用下面的toEdgePartition方法形成EdgePartition
  // 下面就是GraphX中圖數據在分區內部的存儲結構
  def toEdgePartition: EdgePartition[ED, VD] = {
    val edgeArray = edges.trim().array
    new Sorter(Edge.edgeArraySortDataFormat[ED])
      .sort(edgeArray, 0, edgeArray.length, Edge.lexicographicOrdering) // 將圖進行快速排序，先按源點排，再按照目的點排
    val localSrcIds = new Array[Int](edgeArray.size)
    val localDstIds = new Array[Int](edgeArray.size)
    val data = new Array[ED](edgeArray.size)  // 存儲權值
    val index = new GraphXPrimitiveKeyOpenHashMap[VertexId, Int]  // 保存相同srcId的第一個索引值
    val global2local = new GraphXPrimitiveKeyOpenHashMap[VertexId, Int]
    val local2global = new PrimitiveVector[VertexId] // 記錄源點和所有目的點
    var vertexAttrs = Array.empty[VD]  // 頂點屬性

    // Copy edges into columnar structures, tracking the beginnings of source vertex id clusters and
    // adding them to the index. Also populate a map from vertex id to a sequential local offset.

    // 構建邊結構
    if (edgeArray.length > 0) {
      index.update(edgeArray(0).srcId, 0)
      var currSrcId: VertexId = edgeArray(0).srcId
      var currLocalId = -1
      var i = 0
      while (i < edgeArray.size) {
        val srcId = edgeArray(i).srcId  // 獲取第i個點的src
        val dstId = edgeArray(i).dstId  // 獲取第i個點的dst

        // 序號是遞增
        // chageValue方法：若srcId不存在，則執行大括號中的內容，並將currLocalId作爲global2local的value
        // local2global 只會記錄一次源點
        // loaclSrcIds 中記錄是源點在global2local中存的索引值，即currLocalId的結果
        localSrcIds(i) = global2local.changeValue(srcId,
          { currLocalId += 1; local2global += srcId; currLocalId }, identity) // identity相同

        // 序號是遞增
        // 將目的點ID和currLocalId的值存儲到global2local中
        // 並同時更新localDstIds對應的存儲結果
        localDstIds(i) = global2local.changeValue(dstId,
          { currLocalId += 1; local2global += dstId; currLocalId }, identity)


        // 序號是遞增
        data(i) = edgeArray(i).attr  // 存儲第i個點的屬性值

        // index中記錄某個源點ID第一次出現的下標
        if (srcId != currSrcId) {
          currSrcId = srcId
          index.update(currSrcId, i)
        }

        i += 1
      }
      vertexAttrs = new Array[VD](currLocalId + 1)
    }

    new EdgePartition(
      localSrcIds, localDstIds, data, index, global2local, local2global.trim().array, vertexAttrs,
      None)
  }
}

第四步：toEdgePartition

分區內將圖進行快速排序，先按源點排序，再按照目的點排序，new Sorter(Edge.edgeArraySortDataFormat[ED]).sort(edgeArray, 0, edgeArray.length, Edge.lexicographicOrdering)。關於爲什麼要排序的原因，因爲頂點的存儲使用數組，數據是連續的內存空間，順序訪問時，訪問速度更快。

內部存儲主要有如下7個數據結構，下面由簡到難依次介紹。
（1）data：存儲當前分區所有邊的attr的屬性數組。

（2）vertexAttrs：用來存儲頂點的數組，toEdgePartition後爲空。

（3）index：相同srcId的第一次出現的srcId和其下標。

（4、5）localSrcIds/loacalDstIds：是glocal2local.changeValue()返回的一個本地索引，這裏實際的頂點的ID稱爲global，對應的索引稱爲local。

（6）global2local：是spark私有的Map數據結構GraphXPrimitiveKeyOpenHashMap，保存vertextId和本地索引的映射關係。其中包含當前partition中所有srcId、dstId與本地索引的映射關係。

（7）localg2lobal：記錄的是所有的VertexId的數組。其中會包含相同的ID。即：當前分區所有vertextId的順序實際值。
用途：
1. 根據本地下標取VertexId
  localSrcIds/localDstIds -> index -> local2global -> VertexId
2. 根據VertexId取本地下標，取屬性
  VertexId -> global2local -> index -> data -> attr object

3. 構建VertexRDD

第一步：VertexRDD.fromEdges()

構建VertexRDD入口是：val vertices = VertexRDD.fromEdges(edgesCached, edgesCached.partitions.size,defaultVertexAttr).withTargetStorageLevel(vertexStorageLevel)，點是以EdgeRDD[ED, VD]爲基礎來構建的。爲了能通過點找到邊，每個點都需要保存點所在邊的信息即分區ID(pid)，這些信息保存在路由表RoutingTablePartition中。

物理存儲結構如下所示：
第二步：RoutingTablePartition.edgePartitionToMsgs

該方法返回RoutingTableMessage類型的迭代器，對應的數據類型是包含vid和int的tuple類型：(VertexId, Int)，爲了節省內存，把edgePartitionId和一個標誌位通過一個32位的int表示。int的32~31位表示一個標誌位，01: isSrcId，10: isDstId。30～0位表示邊分區ID。
```
val vid2pid = edges.partitionsRDD.mapPartitions(_.flatMap(Function.tupled(RoutingTablePartition.edgePartitionToMsgs))).setName("VertexRDD.createRoutingTables - vid2pid (aggregation)")
```
第三步：RoutingTablePartition.fromMsgs

（1）將上面生成的消息路由表信息進行重新分區，分區數保持和edge的分區數一致。
```
val numEdgePartitions = edges.partitions.size
vid2pid.partitionBy(vertexPartitioner).mapPartitions(iter => Iterator(RoutingTablePartition.fromMsgs(numEdgePartitions, iter)), preservesPartitioning = true)
```
（2）在新分區中，mapPartition的數據，從RoutingTableMessage解出數據：vid,edge pid,isSrcId/isDstId。這個三個數據項重新封裝到三個數據結構中:pid2vid,srcFlags,dstFlags。

（3）ShippableVertexPartition

根據上面routingTables，重新封裝路由表裏的數據結構爲：ShippableVertexPartition。ShippableVertexPartition會合並相同重複點的屬性attr對象，補全缺失的attr對象。得到的對象是ShippableVertexPartition(map.keySet, map._values, map.keySet.getBitSet, routingTable)，包括keyset，values和routingTable。

（4）new VertexRDDImpl()
創建完對象後會生成VertexRDD。

4. 生成Graph對象

把上述edgeRDD和vertexRDD拿過來組成Graph

new GraphImpl(vertices, new ReplicatedVertexView(edges.asInstanceOf[EdgeRDDImpl[ED, VD]]))

【完】

GraphX源碼解析（Graph構建過程）

0. Graph構建

1. 構建圖的方法

2. 構建EdgeRDD

2.1 從HDFS加載文本文件

2.2 詳細構建過程

3. 構建VertexRDD

4. 生成Graph對象

推薦2款開源、美觀的WinForm UI控件庫

NET9 AspnetCore將整合OpenAPI的文檔生成功能而無需三方庫

在Linux下管理MySQL的大小寫敏感性

IndexedRDD 源碼解讀一

分佈式圖並行計算框架：PowerGraph

Chapter11 類型參數

Chapter07 包和引入

Chapter10 注解

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結