spark - Pair RDD (Key/Value Pairs)

- Create Pair RDD

  • from regular RDD by calling map function.
val pairs = lines.map(x => (x.split(" ")(0), x))


  • transformation on Pair RDD (data: {(1,2),(3,4),(3,6)})
  1. reduceByKey => {(1,2), (3,10)}
  2. groupByKey => {(1,[2]), (3, [4, 6])}
  3. mapValues => {(1,3), (3,5), (3,7)} //x => x+1
  4. flatMapValues => {(1,2), (1,3), (1,4), (1,5) (3,4),(3,5)} // x => (x to 5)
  5. keys => {1,3,3}
  6. values => {2,4,6}
  7. sortByKey => 
  8. combineByKey(creater for each key, accumulator for each partition, merger of acc from different partition)

  • transformation on two pair RDDs (rdd={(1,2),(3,4),(3,6)}, other={(3,9)})
  1. subtractByKey => {(1,2)}
  2. join => {(3, (4,9)), (3, (6,9))}
  3. rightOuterJoin => {(3,(Some(4),9)), (3,(Some(6),9))}
  4. leftOuterJoin => {(1,(2,None)),(3,(4,Some(9))),(3,(6,Some(9)))}
  5. cogroup => {(1,([2],[])),(3,([4, 6],[9]))}
  • actions on pair RDDs (countByKey, collectAsMap, lookup)
  • Partition 
  1. partition before transformation or action (be beneficial when partitions will be used multiple times). the partitioned RDD needs to be persisted after partitionBy method. Otherwise, reevaluation is needed each time the RDD is re-used.
  2. partition info is kept for these operations, cogroup(), groupWith(), join(), leftOuterJoin(), rightOuterJoin(),
    groupByKey(), reduceByKey(), combineByKey(), partitionBy(), sort(), mapValues()
    (if the parent RDD has a partitioner), flatMapValues() (if parent has a partitioner), and
    filter() (if parent has a partitioner).
  3. for binary operations, partitioner is default partitioner (HashPartitioner), one of its parent's partitioner (caller win)
  4. custom partitioner to extend Partitioner (note: hashcode could be negative. but partition needs to be positive)

numPartitions: Int

getPartition(key: Any): Int,

equals()
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章