-
基礎轉換操作
-
鍵值轉換操作
基礎轉換操作
-
mapPartitions[U](f:(Iterator[T]) => Iterator[U], preservesPartitioning: Boolean=false): RDD[U]
mapPartitions操作和map操作類似,區別在於映射函數的參數由RDD中的每一個元素變成了RDD中的每一個分區的迭代器,可以理解爲map操作中對RDD中的所有元素應用映射函數,mapPartitions操作是對RDD中每個分區的元素應用映射函數,相當於按分區進行分組。參數preservesPartitioning標識是否保留父RDD的分區信息。
該操作的應用場景適合那些在映射過程中需要頻繁創建額外對象,效率比map高。舉例,將RDD中的元素通過JDBC連接數據庫並寫入,map操作就需要爲每個元素建立一個連接,開銷很大,而mapPartitions只需要對每個分區建立一個連接。
scala> var rdd = sc.parallelize(1 to 10, 4)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[52] at parallelize at <console>:24
# 將rdd中的每個分區元素累加
scala> var rdd2 = rdd.mapPartitions(x => {var result = List[Int](); var i=0; while(x.hasNext){i+=x.next;}; result.::(i).iterator})
rdd2: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[68] at mapPartitions at <console>:25
scala> rdd2.collect
res44: Array[Int] = Array(3, 12, 13, 27)
-
mapPartitionsWithIndex[U](f:(Int, Iterator[T]) => Iterator[U], preservesPartitioning: Boolean=false):RDD[U]
該操作與mapPartitions類似,區別在於輸入參數多了一個分區索引
scala> var rdd = sc.parallelize(1 to 10, 4)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[52] at parallelize at <console>:24
# index爲分區索引,將rdd中的每個分區元素累加,並在結果前面天啊及分區索引
scala> var rdd2 = rdd.mapPartitionsWithIndex((index,x) => {var result = List[String](); var i=0; while(x.hasNext){i+=x.next;}; result.::(index + "|" + i).iterator})
rdd2: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[69] at mapPartitionsWithIndex at <console>:25
scala> rdd2.collect
res48: Array[String] = Array(0|3, 1|12, 2|13, 3|27)
-
zip[U](other:RDD[U]):RDD[T,U]
將兩個RDD組合成Key/Value形勢的RDD,默認兩個RDD具有相同的分區數和元素個數,否則會報錯。
scala> var rdd1 = sc.makeRDD(1 to 3, 2)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[70] at makeRDD at <console>:24
scala> var rdd2 = sc.makeRDD(4 to 6, 2)
rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[71] at makeRDD at <console>:24
scala> var rdd3 = sc.makeRDD(7 to 9, 1)
rdd3: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[72] at makeRDD at <console>:24
scala> rdd1.zip(rdd2).collect
res49: Array[(Int, Int)] = Array((1,4), (2,5), (3,6))
scala> rdd1.zip(rdd3).collect
java.lang.IllegalArgumentException: Can't zip RDDs with unequal numbers of partitions: List(2, 1)
at org.apache.spark.rdd.ZippedPartitionsBaseRDD.getPartitions(ZippedPartitionsRDD.scala:57)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2099)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:945)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.collect(RDD.scala:944)
... 49 elided
-
zipPartitions[B,V](rdd2:RDD[B])(f:(Iterator[T], Iterator[B]) => Iterator[V]):RDD[V]
-
zipPartitions[B,V](rdd2:RDD[B], preservesPartitioning:Boolean)(f:(Iterator[T], Iterator[B]) => Iterator[V]):RDD[V]
-
zipPartitions[B,C,V](rdd2:RDD[B], rdd3:RDD[C])(f:(Iterator[T], Iterator[B], Iterator[C]) => Iterator[V]):RDD[V]
-
zipPartitions[B,C,V](rdd2:RDD[B], rdd3:RDD[C], preservesPartitioning:Boolean)(f:(Iterator[T], Iterator[B], Iterator[C]) => Iterator[V]):RDD[V]
-
zipPartitions[B,C,D,V](rdd2:RDD[B], rdd3:RDD[C], rdd4:RDD[D])(f:(Iterator[T], Iterator[B], Iterator[C], Iterator[D]) => Iterator[V]):RDD[V]
-
zipPartitions[B,C,D,V](rdd2:RDD[B], rdd3:RDD[C], rdd4:RDD[D], preservesPartitioning:Boolean)(f:(Iterator[T], Iterator[B], Iterator[C], Iterator[D]) => Iterator[V]):RDD[V]
zipPartitions操作將多個RDD按照分區組合爲新的RDD,默認需要組合的RDD具有相同的分區,但對於每個分區內的元素個數並不要求一致。
-
zipWithIndex(): RDD[(T, Long)]
zipWithIndex操作將RDD中的元素和該元素在RDD中的索引號組合成鍵值對。
scala> rdd1.zipWithIndex().collect
res51: Array[(Int, Long)] = Array((1,0), (2,1), (3,2))
-
zipWithUniqueId(): RDD[(T, Long)]
zipWithUniqueId操作將RDD中的元素與一個唯一的ID組合成鍵值對,該唯一ID的算法如下:
-
每個分區第一個元素的ID爲:該分區索引號
-
每個分區第N個元素的ID爲:前一個元素的唯一ID值 + 該RDD的分區數
scala> rdd1.zipWithUniqueId().collect
res52: Array[(Int, Long)] = Array((1,0), (2,1), (3,3))