Spark操作——轉換操作(二)

  • 基礎轉換操作

  • 鍵值轉換操作

 

基礎轉換操作

  • mapPartitions[U](f:(Iterator[T]) => Iterator[U], preservesPartitioning: Boolean=false): RDD[U]

    mapPartitions操作和map操作類似,區別在於映射函數的參數由RDD中的每一個元素變成了RDD中的每一個分區的迭代器,可以理解爲map操作中對RDD中的所有元素應用映射函數,mapPartitions操作是對RDD中每個分區的元素應用映射函數,相當於按分區進行分組。參數preservesPartitioning標識是否保留父RDD的分區信息。

    該操作的應用場景適合那些在映射過程中需要頻繁創建額外對象,效率比map高。舉例,將RDD中的元素通過JDBC連接數據庫並寫入,map操作就需要爲每個元素建立一個連接,開銷很大,而mapPartitions只需要對每個分區建立一個連接。

scala> var rdd = sc.parallelize(1 to 10, 4)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[52] at parallelize at <console>:24

# 將rdd中的每個分區元素累加
scala> var rdd2 = rdd.mapPartitions(x => {var result = List[Int](); var i=0; while(x.hasNext){i+=x.next;}; result.::(i).iterator})
rdd2: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[68] at mapPartitions at <console>:25

scala> rdd2.collect
res44: Array[Int] = Array(3, 12, 13, 27)
  • mapPartitionsWithIndex[U](f:(Int, Iterator[T]) => Iterator[U], preservesPartitioning: Boolean=false):RDD[U]

該操作與mapPartitions類似,區別在於輸入參數多了一個分區索引

scala> var rdd = sc.parallelize(1 to 10, 4)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[52] at parallelize at <console>:24

# index爲分區索引,將rdd中的每個分區元素累加,並在結果前面天啊及分區索引
scala> var rdd2 = rdd.mapPartitionsWithIndex((index,x) => {var result = List[String](); var i=0; while(x.hasNext){i+=x.next;}; result.::(index + "|" + i).iterator})
rdd2: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[69] at mapPartitionsWithIndex at <console>:25

scala> rdd2.collect
res48: Array[String] = Array(0|3, 1|12, 2|13, 3|27)
  • zip[U](other:RDD[U]):RDD[T,U]

將兩個RDD組合成Key/Value形勢的RDD,默認兩個RDD具有相同的分區數和元素個數,否則會報錯。

scala> var rdd1 = sc.makeRDD(1 to 3, 2)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[70] at makeRDD at <console>:24

scala> var rdd2 = sc.makeRDD(4 to 6, 2)
rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[71] at makeRDD at <console>:24

scala> var rdd3 = sc.makeRDD(7 to 9, 1)
rdd3: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[72] at makeRDD at <console>:24

scala> rdd1.zip(rdd2).collect
res49: Array[(Int, Int)] = Array((1,4), (2,5), (3,6))

scala> rdd1.zip(rdd3).collect
java.lang.IllegalArgumentException: Can't zip RDDs with unequal numbers of partitions: List(2, 1)
  at org.apache.spark.rdd.ZippedPartitionsBaseRDD.getPartitions(ZippedPartitionsRDD.scala:57)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
  at scala.Option.getOrElse(Option.scala:121)
  at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2099)
  at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:945)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
  at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
  at org.apache.spark.rdd.RDD.collect(RDD.scala:944)
  ... 49 elided
  • zipPartitions[B,V](rdd2:RDD[B])(f:(Iterator[T], Iterator[B]) => Iterator[V]):RDD[V]

  • zipPartitions[B,V](rdd2:RDD[B], preservesPartitioning:Boolean)(f:(Iterator[T], Iterator[B]) => Iterator[V]):RDD[V]

  • zipPartitions[B,C,V](rdd2:RDD[B], rdd3:RDD[C])(f:(Iterator[T], Iterator[B], Iterator[C]) => Iterator[V]):RDD[V]

  • zipPartitions[B,C,V](rdd2:RDD[B], rdd3:RDD[C], preservesPartitioning:Boolean)(f:(Iterator[T], Iterator[B], Iterator[C]) => Iterator[V]):RDD[V]

  • zipPartitions[B,C,D,V](rdd2:RDD[B], rdd3:RDD[C], rdd4:RDD[D])(f:(Iterator[T], Iterator[B], Iterator[C], Iterator[D]) => Iterator[V]):RDD[V]

  • zipPartitions[B,C,D,V](rdd2:RDD[B], rdd3:RDD[C], rdd4:RDD[D], preservesPartitioning:Boolean)(f:(Iterator[T], Iterator[B], Iterator[C], Iterator[D]) => Iterator[V]):RDD[V]

zipPartitions操作將多個RDD按照分區組合爲新的RDD,默認需要組合的RDD具有相同的分區,但對於每個分區內的元素個數並不要求一致。

  • zipWithIndex(): RDD[(T, Long)]

zipWithIndex操作將RDD中的元素和該元素在RDD中的索引號組合成鍵值對。

scala> rdd1.zipWithIndex().collect
res51: Array[(Int, Long)] = Array((1,0), (2,1), (3,2))
  • zipWithUniqueId(): RDD[(T, Long)]

zipWithUniqueId操作將RDD中的元素與一個唯一的ID組合成鍵值對,該唯一ID的算法如下:

  1. 每個分區第一個元素的ID爲:該分區索引號

  2. 每個分區第N個元素的ID爲:前一個元素的唯一ID值 + 該RDD的分區數

scala> rdd1.zipWithUniqueId().collect
res52: Array[(Int, Long)] = Array((1,0), (2,1), (3,3))

 

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章