spark編程模型(十二)之RDD鍵值轉換操作(Transformation Operation)——partitionBy、mapValues、flatMapValues

partitionBy()

  • def partitionBy(partitioner: Partitioner): RDD[(K, V)]

  • 該函數根據partitioner函數生成新的ShuffleRDD,將原RDD重新分區

    scala> var rdd1 = sc.makeRDD(Array((1,"A"),(2,"B"),(3,"C"),(4,"D")),2)
    rdd1: org.apache.spark.rdd.RDD[(Int, String)] = ParallelCollectionRDD[23] at makeRDD at :21
    
    scala> rdd1.partitions.size
    res20: Int = 2
    
    //查看rdd1中每個分區的元素
    scala> rdd1.mapPartitionsWithIndex{
         |         (partIdx,iter) => {
         |           var part_map = scala.collection.mutable.Map[String,List[(Int,String)]]()
         |             while(iter.hasNext){
         |               var part_name = "part_" + partIdx;
         |               var elem = iter.next()
         |               if(part_map.contains(part_name)) {
         |                 var elems = part_map(part_name)
         |                 elems ::= elem
         |                 part_map(part_name) = elems
         |               } else {
         |                 part_map(part_name) = List[(Int,String)]{elem}
         |               }
         |             }
         |             part_map.iterator
         |            
         |         }
         |       }.collect
    res22: Array[(String, List[(Int, String)])] = Array((part_0,List((2,B), (1,A))), (part_1,List((4,D), (3,C))))
    //(2,B),(1,A)在part_0中,(4,D),(3,C)在part_1中
    
    //使用partitionBy重分區
    scala> var rdd2 = rdd1.partitionBy(new org.apache.spark.HashPartitioner(2))
    rdd2: org.apache.spark.rdd.RDD[(Int, String)] = ShuffledRDD[25] at partitionBy at :23
    
    scala> rdd2.partitions.size
    res23: Int = 2
    
    //查看rdd2中每個分區的元素
    scala> rdd2.mapPartitionsWithIndex{
         |         (partIdx,iter) => {
         |           var part_map = scala.collection.mutable.Map[String,List[(Int,String)]]()
         |             while(iter.hasNext){
         |               var part_name = "part_" + partIdx;
         |               var elem = iter.next()
         |               if(part_map.contains(part_name)) {
         |                 var elems = part_map(part_name)
         |                 elems ::= elem
         |                 part_map(part_name) = elems
         |               } else {
         |                 part_map(part_name) = List[(Int,String)]{elem}
         |               }
         |             }
         |             part_map.iterator
         |         }
         |       }.collect
    res24: Array[(String, List[(Int, String)])] = Array((part_0,List((4,D), (2,B))), (part_1,List((3,C), (1,A))))
    //(4,D),(2,B)在part_0中,(3,C),(1,A)在part_1中
    

mapValues()

  • def mapValues[U](f: (V) => U): RDD[(K, U)]

  • 同基本轉換操作中的map,只不過mapValues是針對[K,V]中的V值進行map操作

    scala> var rdd1 = sc.makeRDD(Array((1,"A"),(2,"B"),(3,"C"),(4,"D")),2)
    rdd1: org.apache.spark.rdd.RDD[(Int, String)] = ParallelCollectionRDD[27] at makeRDD at :21
    
    scala> rdd1.mapValues(x => x + "_").collect
    res26: Array[(Int, String)] = Array((1,A_), (2,B_), (3,C_), (4,D_))
    

flatMapValues()

  • def flatMapValues[U](f: (V) => TraversableOnce[U]): RDD[(K, U)]

  • 同基本轉換操作中的flatMap,只不過flatMapValues是針對[K,V]中的V值進行flatMap操作

    scala> var rdd1 = sc.makeRDD(Array((1,"A"),(2,"B"),(3,"C"),(4,"D")),2)
    rdd1: org.apache.spark.rdd.RDD[(Int, String)] = ParallelCollectionRDD[27] at makeRDD at :21
    
    scala> rdd1.flatMapValues(x => x + "_").collect
    res36: Array[(Int, Char)] = Array((1,A), (1,_), (2,B), (2,_), (3,C), (3,_), (4,D), (4,_))
    
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章