Spark RDD算子

RDD算子實戰

轉換算子

map(function)

傳入的集合元素進行RDD[T]轉換 def map(f: T => U): org.apache.spark.rdd.RDD[U]

scala>  sc.parallelize(List(1,2,3,4,5),3).map(item => item*2+" " )
res1: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[2] at map at <console>:25

scala>  sc.parallelize(List(1,2,3,4,5),3).map(item => item*2+" " ).collect
res2: Array[String] = Array("2 ", "4 ", "6 ", "8", "10 ")

filter(func)

將滿足條件結果記錄 def filter(f: T=> Boolean): org.apache.spark.rdd.RDD[T]

scala>  sc.parallelize(List(1,2,3,4,5),3).filter(item=> item%2==0).collect
res3: Array[Int] = Array(2, 4)

flatMap(func)

將一個元素轉換成元素的數組，然後對數組展開。def flatMap[U](f: T=> TraversableOnce[U]): org.apache.spark.rdd.RDD[U]

scala>  sc.parallelize(List("ni hao","hello spark"),3).flatMap(line=>line.split("\\s+")).collect
res4: Array[String] = Array(ni, hao, hello, spark)

mapPartitions(func)

與map類似，但在RDD的每個分區（塊）上單獨運行，因此當在類型T的RDD上運行時，func必須是Iterator <T> => Iterator <U>類型

def mapPartitions[U](f: Iterator[Int] => Iterator[U],preservesPartitioning: Boolean): org.apache.spark.rdd.RDD[U]

scala>  sc.parallelize(List(1,2,3,4,5),3).mapPartitions(items=> for(i<-items;if(i%2==0)) yield i*2 ).collect()
res7: Array[Int] = Array(4, 8)

mapPartitionsWithIndex(func)

與mapPartitions類似，但也爲func提供了表示分區索引的整數值，因此當在類型T的RDD上運行時，func必須是類型（Int，Iterator <T>）=> Iterator <U>。

def mapPartitionsWithIndex[U](f: (Int, Iterator[T]) => Iterator[U],preservesPartitioning: Boolean): org.apache.spark.rdd.RDD[U]

scala>  sc.parallelize(List(1,2,3,4,5),3).mapPartitionsWithIndex((p,items)=> for(i<-items) yield (p,i)).collect
res11: Array[(Int, Int)] = Array((0,1), (1,2), (1,3), (2,4), (2,5))

sample(withReplacement, fraction, seed)

對數據進行一定比例的採樣，使用withReplacement參數控制是否允許重複採樣。

def sample(withReplacement: Boolean,fraction: Double,seed: Long): org.apache.spark.rdd.RDD[T]

scala> sc.parallelize(List(1,2,3,4,5,6,7),3).sample(false,0.7,1L).collect
res13: Array[Int] = Array(1, 4, 6, 7)

union(otherDataset)

返回一個新數據集，其中包含源數據集和參數中元素的並集。

def union(other: org.apache.spark.rdd.RDD[T]): org.apache.spark.rdd.RDD[T]

scala> var rdd1=sc.parallelize(Array(("張三",1000),("李四",100),("趙六",300)))
scala> var rdd2=sc.parallelize(Array(("張三",1000),("王五",100),("溫七",300)))
scala> rdd1.union(rdd2).collect
res16: Array[(String, Int)] = Array((張三,1000), (李四,100), (趙六,300), (張三,1000), (王五,100), (溫七,300))

intersection(otherDataset)

返回包含源數據集和參數中元素交集的新RDD。

def intersection(other: org.apache.spark.rdd.RDD[T],numPartitions: Int): org.apache.spark.rdd.RDD[T]

scala> var rdd1=sc.parallelize(Array(("張三",1000),("李四",100),("趙六",300)))
scala> var rdd2=sc.parallelize(Array(("張三",1000),("王五",100),("溫七",300)))
scala> rdd1.intersection(rdd2).collect
res17: Array[(String, Int)] = Array((張三,1000))

distinct([numPartitions]))

返回包含源數據集的不同元素的新數據集。

scala>  sc.parallelize(List(1,2,3,3,5,7,2),3).distinct.collect
res19: Array[Int] = Array(3, 1, 7, 5, 2)

groupByKey([numPartitions])

在（K，V）對的數據集上調用時，返回（K，Iterable <V>）對的數據集。注意：如果要對每個鍵執行聚合（例如總和或平均值）進行分組，則使用reduceByKey或aggregateByKey將產生更好的性能。注意：默認情況下，輸出中的並行級別取決於父RDD的分區數。您可以傳遞可選的numPartitions參數來設置不同數量的任務。

scala>  sc.parallelize(List("ni hao","hello spark"),3).flatMap(line=>line.split("\\s+")).map(word=>(word,1)).groupByKey(3).map(tuple=>(tuple._1,tuple._2.sum)).collect

reduceByKey(func, [numPartitions])

當調用（K，V）對的數據集時，返回（K，V）對的數據集，其中使用給定的reduce函數func聚合每個鍵的值，該函數必須是類型（V，V）=> V.

scala>  sc.parallelize(List("ni hao","hello spark"),3).flatMap(line=>line.split("\\s+")).map(word=>(word,1)).reduceByKey((v1,v2)=>v1+v2).collect()
res33: Array[(String, Int)] = Array((hao,1), (hello,1), (spark,1), (ni,1))

scala>  sc.parallelize(List("ni hao","hello spark"),3).flatMap(line=>line.split("\\s+")).map(word=>(word,1)).reduceByKey(_+_).collect()
res34: Array[(String, Int)] = Array((hao,1), (hello,1), (spark,1), (ni,1))

aggregateByKey(zeroValue)(seqOp, combOp, [numPartitions])

當調用（K，V）對的數據集時，返回（K，U）對的數據集，其中使用給定的組合函數和中性“零”值聚合每個鍵的值。允許與輸入值類型不同的聚合值類型，同時避免不必要的分配。

scala>  sc.parallelize(List("ni hao","hello spark"),3).flatMap(line=>line.split("\\s+")).map(word=>(word,1)).aggregateByKey(0L)((z,v)=>z+v,(u1,u2)=>u1+u2).collect
res35: Array[(String, Long)] = Array((hao,1), (hello,1), (spark,1), (ni,1))

sortByKey([ascending], [numPartitions])

當調用K實現Ordered的（K，V）對數據集時，返回按鍵升序或降序排序的（K，V）對數據集，如布爾升序參數中所指定。

scala>  sc.parallelize(List("ni hao","hello spark"),3).flatMap(line=>line.split("\\s+")).map(word=>(word,1)).aggregateByKey(0L)((z,v)=>z+v,(u1,u2)=>u1+u2).sortByKey(false).collect()
res37: Array[(String, Long)] = Array((spark,1), (ni,1), (hello,1), (hao,1))

sortBy(func,[ascending], [numPartitions])**

對（K，V）數據集調用sortBy時，用戶可以通過指定func指定排序規則，T => U 要求U必須實現Ordered接口

scala>  sc.parallelize(List("ni hao","hello spark"),3).flatMap(line=>line.split("\\s+")).map(word=>(word,1)).aggregateByKey(0L)((z,v)=>z+v,(u1,u2)=>u1+u2).sortBy(_._2,true,2).collect
res42: Array[(String, Long)] = Array((hao,1), (hello,1), (spark,1), (ni,1))

join

當調用類型（K，V）和（K，W）的數據集時，返回（K，（V，W））對的數據集以及每個鍵的所有元素對。通過leftOuterJoin，rightOuterJoin和fullOuterJoin支持外連接。

scala> var rdd1=sc.parallelize(Array(("001","張三"),("002","李四"),("003","王五")))
scala> var rdd2=sc.parallelize(Array(("001",("apple",18.0)),("001",("orange",18.0))))
scala> rdd1.join(rdd2).collect
res43: Array[(String, (String, (String, Double)))] = Array((001,(張三,(apple,18.0))), (001,(張三,(orange,18.0))))

cogroup

當調用類型（K，V）和（K，W）的數據集時，返回（K，（Iterable ，Iterable ））元組的數據集。此操作也稱爲groupWith。

scala> var rdd1=sc.parallelize(Array(("001","張三"),("002","李四"),("003","王五")))
scala> var rdd2=sc.parallelize(Array(("001","apple"),("001","orange"),("002","book")))
scala> rdd1.cogroup(rdd2).collect()
res46: Array[(String, (Iterable[String], Iterable[String]))] = Array((001,(CompactBuffer(張三),CompactBuffer(apple, orange))), (002,(CompactBuffer(李四),CompactBuffer(book))), (003,(CompactBuffer(王五),CompactBuffer())))

cartesian

當調用類型爲T和U的數據集時，返回（T，U）對的數據集（所有元素對）。

scala> var rdd1=sc.parallelize(List("a","b","c"))
scala> var rdd2=sc.parallelize(List(1,2,3,4))
scala> rdd1.cartesian(rdd2).collect()
res47: Array[(String, Int)] = Array((a,1), (a,2), (a,3), (a,4), (b,1), (b,2), (b,3), (b,4), (c,1), (c,2), (c,3), (c,4))

coalesce(numPartitions)

將RDD中的分區數減少爲numPartitions。過濾大型數據集後，可以使用概算子減少分區數。

scala>  sc.parallelize(List("ni hao","hello spark"),3).coalesce(1).partitions.length
res50: Int = 1

scala>  sc.parallelize(List("ni hao","hello spark"),3).coalesce(1).getNumPartitions
res51: Int = 1

repartition

隨機重新調整RDD中的數據以創建更多或更少的分區。

scala> sc.parallelize(List("a","b","c"),3).mapPartitionsWithIndex((index,values)=>for(i<-values) yield (index,i) ).collect
res52: Array[(Int, String)] = Array((0,a), (1,b), (2,c))

scala> sc.parallelize(List("a","b","c"),3).repartition(2).mapPartitionsWithIndex((index,values)=>for(i<-values) yield (index,i) ).collect
res53: Array[(Int, String)] = Array((0,a), (0,c), (1,b))

動作算子

collect

用在測試環境下，通常使用collect算子將遠程計算的結果拿到Drvier端，注意一般數據量比較小，用於測試。

scala> var rdd1=sc.parallelize(List(1,2,3,4,5),3).collect().foreach(println)

saveAsTextFile

將計算結果存儲在文件系統中，一般存儲在HDFS上

scala>  sc.parallelize(List("ni hao","hello spark"),3).flatMap(_.split("\\s+")).map((_,1)).reduceByKey(_+_).sortBy(_._2,false,3).saveAsTextFile("hdfs:///wordcounts")

foreach

迭代遍歷所有的RDD中的元素，通常是將foreach傳遞的數據寫到外圍系統中，比如說可以將數據寫入到Hbase中。

scala> sc.parallelize(List(“ni hao”,“hello spark”),3).flatMap(.split("\s+")).map((,1)).reduceByKey(+).sortBy(_._2,false,3).foreach(println)
(hao,1)
(hello,1)
(spark,1)
(ni,1)

注意如果使用以上代碼寫數據到外圍系統，會因爲不斷創建和關閉連接影響寫入效率，一般推薦使用foreachPartition

val lineRDD: RDD[String] = sc.textFile("file:///E:/demo/words/t_word.txt")
lineRDD.flatMap(line=>line.split(" "))
    .map(word=>(word,1))
    .groupByKey()
    .map(tuple=>(tuple._1,tuple._2.sum))
    .sortBy(tuple=>tuple._2,false,3)
    .foreachPartition(items=>{
        //創建連接
        items.foreach(t=>println("存儲到數據庫"+t))
        //關閉連接
    })

共享變量

變量廣播

通常情況下，當一個RDD的很多操作都需要使用driver中定義的變量時，每次操作，driver都要把變量發送給worker節點一次，如果這個變量中的數據很大的話，會產生很高的傳輸負載，導致執行效率降低。使用廣播變量可以使程序高效地將一個很大的只讀數據發送給多個worker節點，而且對每個worker節點只需要傳輸一次，每次操作時executor可以直接獲取本地保存的數據副本，不需要多次傳輸。

val conf = new SparkConf().setAppName("demo").setMaster("local[2]")
val sc = new SparkContext(conf)

val userList = List(
    "001,張三,28,0",
    "002,李四,18,1",
    "003,王五,38,0",
    "004,zhaoliu,38,-1"
)
val genderMap = Map("0" -> "女", "1" -> "男")
val bcMap = sc.broadcast(genderMap)

sc.parallelize(userList,3)
.map(info=>{
    val prefix = info.substring(0, info.lastIndexOf(","))
    val gender = info.substring(info.lastIndexOf(",") + 1)
    val genderMapValue = bcMap.value
    val newGender = genderMapValue.getOrElse(gender, "未知")
    prefix + "," + newGender
}).collect().foreach(println)

sc.stop()

累加器

Spark提供的Accumulator，主要用於多個節點對一個變量進行共享性的操作。Accumulator只提供了累加的功能。但是確給我們提供了多個task對一個變量並行操作的功能。但是task只能對Accumulator進行累加操作，不能讀取它的值。只有Driver程序可以讀取Accumulator的值。

scala> var count=sc.longAccumulator("count")
scala> sc.parallelize(List(1,2,3,4,5,6),3).foreach(item=> count.add(item))
scala> count.value
res1: Long = 21

Spark入門(四)——Spark RDD算子使用方法