- Create Pair RDD
- from regular RDD by calling map function.
val pairs = lines.map(x => (x.split(" ")(0), x))
- transformation on Pair RDD (data: {(1,2),(3,4),(3,6)})
- reduceByKey => {(1,2), (3,10)}
- groupByKey => {(1,[2]), (3, [4, 6])}
- mapValues => {(1,3), (3,5), (3,7)} //x => x+1
- flatMapValues => {(1,2), (1,3), (1,4), (1,5) (3,4),(3,5)} // x => (x to 5)
- keys => {1,3,3}
- values => {2,4,6}
- sortByKey =>
- combineByKey(creater for each key, accumulator for each partition, merger of acc from different partition)
- transformation on two pair RDDs (rdd={(1,2),(3,4),(3,6)}, other={(3,9)})
- subtractByKey => {(1,2)}
- join => {(3, (4,9)), (3, (6,9))}
- rightOuterJoin => {(3,(Some(4),9)), (3,(Some(6),9))}
- leftOuterJoin => {(1,(2,None)),(3,(4,Some(9))),(3,(6,Some(9)))}
- cogroup => {(1,([2],[])),(3,([4, 6],[9]))}
- actions on pair RDDs (countByKey, collectAsMap, lookup)
- Partition
- partition before transformation or action (be beneficial when partitions will be used multiple times). the partitioned RDD needs to be persisted after partitionBy method. Otherwise, reevaluation is needed each time the RDD is re-used.
- partition info is kept for these operations, cogroup(), groupWith(), join(), leftOuterJoin(), rightOuterJoin(),
groupByKey(), reduceByKey(), combineByKey(), partitionBy(), sort(), mapValues()
(if the parent RDD has a partitioner), flatMapValues() (if parent has a partitioner), and
filter() (if parent has a partitioner). - for binary operations, partitioner is default partitioner (HashPartitioner), one of its parent's partitioner (caller win)
- custom partitioner to extend Partitioner (note: hashcode could be negative. but partition needs to be positive)
numPartitions: Int
getPartition(key: Any): Int,