-
集合標量行動操作
-
存儲行動操作
集合標量行動操作
-
first(): T 返回RDD中的第一個元素,不進行排序
-
count(): Long 返回RDD中的元素個數
-
reduce(f:(T, T) => T): T 根據映射函數f,對元素進行二元計算
-
collect(): Array[T] 將RDD轉換爲數組
-
take(num: Int): Array[T] 獲取RDD中下標從0—num-1的元素,不進行排序
- top(num: Int): Array[T] 從RDD中,按照默認(降序)或者指定排序規則,返回前num個元素
- takeOrdered(num: Int): Array[T] 和top功能類似,區別在於按照top相反的順序返回元素
scala> var rdd = sc.makeRDD(Array(("A", 1), ("A", 2), ("A", 3), ("B", 4), ("B", 5), ("C", 6), ("C", 7), ("C", 8), ("C", 9), ("D", 10)))
rdd: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[60] at makeRDD at <console>:24
scala> rdd.collect
res50: Array[(String, Int)] = Array((A,1), (A,2), (A,3), (B,4), (B,5), (C,6), (C,7), (C,8), (C,9), (D,10))
scala> rdd.count()
res46: Long = 10
scala> rdd.first()
res45: (String, Int) = (A,1)
scala> rdd.reduce((x, y) => (x._1 + y._1, x._2 + y._2))
res49: (String, Int) = (AACCABBCCD,55)
scala> rdd.take(2)
res51: Array[(String, Int)] = Array((A,1), (A,2))
scala> rdd.top(1)
res54: Array[(String, Int)] = Array((D,10))
scala> rdd.takeOrdered(1)
res56: Array[(String, Int)] = Array((A,1))
scala> rdd.takeOrdered(2)
res57: Array[(String, Int)] = Array((A,1), (A,2))
- aggregate[U](zeroValue: U)(seqOp: (U, T) => U, combOp: (U, U) => U)(implicit arg(): ClassTag[U]): U
聚合RDD中的元素,先使用seqOp將RDD中每個分區中的T類型元素聚合成U類型,再使用combOp將之前每個分區聚合後的U類型聚合成U類型,需要注意的是seqOp和combOp都會使用到zeroValue的值
// 定義rdd,設置第一個分區中包含1,2,3,4,5,第二個分區中包含6,7,8,9,10
scala> var rdd = sc.makeRDD(1 to 10, 2)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[65] at makeRDD at <console>:24
scala> rdd.mapPartitionsWithIndex{
| (partIdx, iter) => {
| var part_map = scala.collection.mutable.Map[String, List[(Int)]]()
| while(iter.hasNext){
| var part_name = "part_" + partIdx;
| var elem = iter.next()
| if(part_map.contains(part_name)) {
| var elems = part_map(part_name)
| elems ::= elem
| part_map(part_name) = elems
| }
| else{
| part_map(part_name) = List[(Int)]{elem}
| }
| }
| part_map.iterator
| }
| }.collect
res59: Array[(String, List[Int])] = Array((part_0,List(5, 4, 3, 2, 1)), (part_1,List(10, 9, 8, 7, 6)))
// aggregate的最後結果是58,原因是先在每個分區中迭代執行(x: Int, y: Int) => x + y,並且使用zeroValue的值1,
// 即part_0中計算過程爲 1+1+2+3+4+5=16,part_1中計算過程爲1+6+7+8+9+10=41
// 再將兩個分區中的結果執行(a: Int, b: Int) => a + b,並應用zeroValue的值,結果爲1+16+41=58
scala> rdd.aggregate(1)(
| {(x: Int, y: Int) => x + y},
| {(a: Int, b: Int) => a + b}
| )
res61: Int = 58
-
fold(zeroValue: T)(op: (T, T) => T): T
fold操作與aggregate操作功能類似,區別在於seqOp和combOp是統一個函數
scala> rdd.fold(1)(
| (x, y) => x + y
| )
res63: Int = 58
-
lookup(key: K): Seq[K]
該操作應用於(K, V)形式的RDD,返回指定K所對應的所以V值
scala> rdd.fold(1)(
| (x, y) => x + y
| )
res63: Int = 58
-
countByKey(): Map[K, Long]
統計RDD[K, V]中每個K的數量
scala> var rdd = sc.makeRDD(Array(("A", 1), ("A", 2), ("A", 3), ("B", 4), ("B", 5), ("C", 6), ("C", 7), ("C", 8), ("C", 9), ("D", 10)))
rdd: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[67] at makeRDD at <console>:24
scala> rdd.countByKey()
res65: scala.collection.Map[String,Long] = Map(D -> 1, A -> 3, B -> 2, C -> 4)
-
foreach(f: (T) => Unit): Unit
-
foreachPartition(f: (Iterator[T]) => Unit): Unit
foreach遍歷RDD中的每個元素,並應用函數f。foreachPartition與foreach類型,區別在於前者對針對每個分區。
scala> var rdd = sc.makeRDD(Array(("A", 1), ("A", 2), ("A", 3), ("B", 4), ("B", 5), ("C", 6), ("C", 7), ("C", 8), ("C", 9), ("D", 10)))
rdd: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[67] at makeRDD at <console>:24
scala> rdd.foreach(println)
(A,1)
(A,3)
(C,8)
(C,6)
(C,9)
(B,4)
(A,2)
(B,5)
(D,10)
(C,7)
-
sortBy[K](f: (T) => K, ascending: Boolean=true, numPartitions: Int=this.partitions.length): RDD[T]
根據指定的排序函數f對K進行排序
scala> var rdd = sc.makeRDD(Array(("A", 1), ("A", 2), ("A", 3), ("B", 4), ("B", 5), ("C", 6), ("C", 7), ("C", 8), ("C", 9), ("D", 10)))
rdd: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[67] at makeRDD at <console>:24
scala> rdd.sortBy(x => x).collect
res68: Array[(String, Int)] = Array((A,1), (A,2), (A,3), (B,4), (B,5), (C,6), (C,7), (C,8), (C,9), (D,10))
scala> rdd.sortBy(x => x, false).collect
res70: Array[(String, Int)] = Array((D,10), (C,9), (C,8), (C,7), (C,6), (B,5), (B,4), (A,3), (A,2), (A,1))
參考:
[1] 郭景瞻. 圖解Spark:核心技術與案例實戰[M]. 北京:電子工業出版社, 2017.