spark編程模型(十四)之RDD鍵值轉換操作(Transformation Operation)——groupByKey、reduceByKey、reduceByKeyLocally

groupByKey

  • def groupByKey(): RDD[(K, Iterable[V])]
  • def groupByKey(numPartitions: Int): RDD[(K, Iterable[V])]
  • def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])]
  • 該函數用於將RDD[K,V]中每個K對應的V值,合併到一個集合Iterable[V]中
  • 參數numPartitions用於指定分區數
  • 參數partitioner用於指定分區函數

    scala> var rdd1 = sc.makeRDD(Array(("A",0),("A",2),("B",1),("B",2),("C",1)))
    rdd1: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[89] at makeRDD at :21
    
    scala> rdd1.groupByKey().collect
    res81: Array[(String, Iterable[Int])] = Array((A,CompactBuffer(0, 2)), (B,CompactBuffer(2, 1)), (C,CompactBuffer(1)))
    

reduceByKey

  • def reduceByKey(func: (V, V) => V): RDD[(K, V)]
  • def reduceByKey(func: (V, V) => V, numPartitions: Int): RDD[(K, V)]
  • def reduceByKey(partitioner: Partitioner, func: (V, V) => V): RDD[(K, V)]
  • 該函數用於將RDD[K,V]中每個K對應的V值根據映射函數來運算
  • 參數numPartitions用於指定分區數
  • 參數partitioner用於指定分區函數

    scala> var rdd1 = sc.makeRDD(Array(("A",0),("A",2),("B",1),("B",2),("C",1)))
    rdd1: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[91] at makeRDD at :21
    
    scala> rdd1.partitions.size
    res82: Int = 15
    
    scala> var rdd2 = rdd1.reduceByKey((x,y) => x + y)
    rdd2: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[94] at reduceByKey at :23
    
    scala> rdd2.collect
    res85: Array[(String, Int)] = Array((A,2), (B,3), (C,1))
    
    scala> rdd2.partitions.size
    res86: Int = 15
    
    scala> var rdd2 = rdd1.reduceByKey(new org.apache.spark.HashPartitioner(2),(x,y) => x + y)
    rdd2: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[95] at reduceByKey at :23
    
    scala> rdd2.collect
    res87: Array[(String, Int)] = Array((B,3), (A,2), (C,1))
    
    scala> rdd2.partitions.size
    res88: Int = 2
    

reduceByKeyLocally

  • def reduceByKeyLocally(func: (V, V) => V): Map[K, V]
  • 該函數將RDD[K,V]中每個K對應的V值根據映射函數來運算,運算結果映射到一個Map[K,V]中,而不是RDD[K,V]

    scala> var rdd1 = sc.makeRDD(Array(("A",0),("A",2),("B",1),("B",2),("C",1)))
    rdd1: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[91] at makeRDD at :21
    
    scala> rdd1.reduceByKeyLocally((x,y) => x + y)
    res90: scala.collection.Map[String,Int] = Map(B -> 3, A -> 2, C -> 1)
    
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章