spark--鍵值對操作
- 1. pair RDD
- 2. pair RDD 創建
- 3. pair RDD 轉化操作
- 3.1 reduceByKey 根據鍵聚合
- 3.2 groupByKey 根據鍵分組
- 3.3 keys 獲取鍵
- 3.4 values 獲取值
- 3.5 sortByKey 根據鍵排序
- 3.6 mapValues 值操作
- 3.7 flatMapValues 合併值流操作
- 3.8 combineByKey 根據鍵自定義聚合
- 3.9 subtractByKey 差集
- 3.10 join 內連接
- 3.11 rightOuterJoin 右外連接
- 3.12 leftOuterJoin 左外連接
- 3.13 cogroup 並集
- 3.14 轉化操作速查表
- 4. pair RDD 轉化操作分類
- 5. pair RDD 行動操作
1. pair RDD
spark爲包含鍵值對類型的RDD提供了一些專有的操作。這些RDD被稱爲pair RDD,pair RDD是很多程序的構成要素,因爲他們提供了並行操作各個鍵或跨節點重新進行數據分組的啊哦做接口。比如普通的RDD有countByValue,而pair RDD提供了reduceByKey的操作。
2. pair RDD 創建
根據之前瞭解的普通RDD的一些轉化操作和pair RDD的定義,我們知道,pair RDD 可以從普通的RDD使用替換的轉化操作得到。
val nums = sc.parallelize(List(1,2,3,4,5))
val pairNums = nums map (x => (x,1))
println(pairNums.collect.mkString(","))
3. pair RDD 轉化操作
3.1 reduceByKey 根據鍵聚合
格式:
pairRDD reduceByKey ( => )
val nums = sc.parallelize(List(1,2,3,4,5,6,7,8,9))
val all = nums sample (true, 100000)
val pairNums = all map (x => (x,1))
val sumNums = pairNums reduceByKey(_+_)
println(nums.collect.mkString(","))
println(pairNums.collect.mkString(","))
println(sumNums.collect.mkString(","))
val result = sumNums map (x => (x._1,x_2/100000.toDouble))
println(result.collect.mkString(","))
如果將數量設置到100億呢?
不可思議,100億啊,6.7分鐘就完了。
3.2 groupByKey 根據鍵分組
格式:
pairRDD groupByKey
val initNums = sc parallelize ( 0 to 9)
val pairNums= initNums sample(true,20) map (x => (x, x+"x"))
println(pairNums.collect.mkString(","))
println(pairNums.groupByKey.collect.mkString(","))
3.3 keys 獲取鍵
格式
pairRDD keys
val initNums = sc parallelize ( 0 to 9)
val pairNums = initNums sample(true,20) map (x => (x, x+1))
println(pairNums.collect.mkString(","))
println(pairNums.keys.collect.mkString(","))
3.4 values 獲取值
格式
pairRDD values
val initNums = sc parallelize ( 0 to 9)
val pairNums= initNums sample(true,20) map (x => (x, x+"x"))
println(pairNums.collect.mkString(","))
println(pairNums.values.collect.mkString(","))
3.5 sortByKey 根據鍵排序
格式:
pairRDD sortByKey
val nums = sc parallelize ( 0 to 9) map (x => (10 - x, x))
println(nums.collect.mkString(","))
println(nums.sortByKey().collect.mkString(","))
3.6 mapValues 值操作
格式:
pairRDD mapValues ( => )
val nums = sc parallelize (0 to 9) map (x => (x%4,x))
nums collect() foreach print
nums mapValues ( x => x * 10 ) collect() foreach print
3.7 flatMapValues 合併值流操作
格式:
pairRDD flatMapValues( => )
val nums = sc parallelize ( 0 to 9 ) map ( x => ( x, x ))
nums collect() foreach print
nums flatMapValues ( x => x to 10 ) collect() foreach print
3.8 combineByKey 根據鍵自定義聚合
格式:
pairRDD combineByKey( => , => , => )
第一個 => :元素轉返回類型
第一個 => :參數 元素
第二個 => :分區內元素聚合
第二個 => :參數 返回類型,元素
第三個 => :分區聚合
第三個 => :參數 返回類型,返回類型
val nums = sc parallelize(List(("A",66),("B",56),("C",88),("D",99),("A",33),("B",67858),("C",8987),("D",11231)))
type Mt = (Int,Int)
nums.combineByKey(a => (a,1),(x:Mt,s)=>(x._1+s,x._2 + 1),(c:Mt,d:Mt)=>(c._1+d._1,c._2+d._2)) map{ case (key,value) => (key, value._1/value._2.toDouble)} collect() foreach print
3.9 subtractByKey 差集
格式:
pairRDD1 subtractByKey pairRDD2
val num1 = sc parallelize(List(("A",66),("B",78),("C",77),("D",88)))
val num2 = sc parallelize(List(("A",66),("C",77)))
num1 collect() foreach print
num2 collect() foreach print
num1 subtract num2 collect() foreach print
3.10 join 內連接
格式:
pairRDD1 join pairRDD2
val num1 = sc parallelize(List(("A",66),("B",78),("C",77),("D",88)))
val num2 = sc parallelize(List(("A",778),("C",899)))
num1 collect() foreach print
num2 collect() foreach print
num1 join num2 collect() foreach print
3.11 rightOuterJoin 右外連接
格式:
pairRDD1 rightOuterJoin pairRDD2
val num1 = sc parallelize(List(("A",66),("B",78),("C",77),("D",88)))
val num2 = sc parallelize(List(("A",778),("C",899)))
num1 collect() foreach print
num2 collect() foreach print
num1 rightOuterJoin num2 collect() foreach print
3.12 leftOuterJoin 左外連接
格式:
pairRDD1 leftOuterJoin pairRDD2
val num1 = sc parallelize(List(("A",66),("B",78),("C",77),("D",88)))
val num2 = sc parallelize(List(("A",778),("C",899)))
num1 collect() foreach print
num2 collect() foreach print
num1 leftOuterJoin num2 collect() foreach print
3.13 cogroup 並集
格式:
pairRDD1 cogroup pairRDD2
val num1 = sc parallelize(List(("A",66),("B",78),("C",77),("D",88)))
val num2 = sc parallelize(List(("A",778),("C",899)))
num1 collect() foreach print
num2 collect() foreach print
num1 cogroup num2 collect() foreach print
3.14 轉化操作速查表
操作名 | 方法名 | 格式 |
---|---|---|
根據鍵聚合 | reduceByKey | pairRDD reduceByKey ( => ) |
根據鍵分組 | groupByKey | pairRDD groupByKey |
獲取鍵 | keys | pairRDD keys |
獲取值 | values | pairRDD values |
根據鍵排序 | sortByKey | pairRDD mapValues ( => ) |
值操作 | flatMapValues | pairRDD mapValues ( => ) |
合併值流操作 | combineByKey | pairRDD flatMapValues( => ) |
根據鍵自定義聚合 | combineByKey | pairRDD combineByKey( => , => , => ) |
差集 | subtractByKey | pairRDD1 subtractByKey pairRDD2 |
內連接 | join | pairRDD1 join pairRDD2 |
右外連接 | rightOuterJoin | pairRDD1 rightOuterJoin pairRDD2 |
左外連接 | leftOuterJoin | pairRDD1 leftOuterJoin pairRDD2 |
交集 | cogroup | pairRDD1 cogroup pairRDD2 |
4. pair RDD 轉化操作分類
4.1 元素
4.1.1 map
因爲pair RDD 是繼承RDD的,所以,RDD的操作,pair RDD都可以使用。
格式:
pairRDD map {case (key,value) => (key, value’)}
val keys = sc parallelize( 1 to 6)
val pairs = keys map (x => (x,x*x))
pairs map {case (key,value) => (key, value * 10 )} collect() foreach print
4.1.2 filter
格式:
pairRDD filter {{case (key,value) => Boolean}
val keys = sc parallelize( 1 to 6)
val pairs = keys map (x => (x,x*x*10))
pairs collect() foreach print
pairs filter {case (key,value) => value < 100 } collect() foreach print
4.1.3 keys
格式:
pairRDD keys
val keys = sc parallelize( 1 to 6)
val pairs = keys map (x => (x,x*x*10))
pairs collect() foreach print
println(pairs.keys.collect.mkString(","))
4.1.4 values
格式:
pairRDD values
val keys = sc parallelize( 1 to 6)
val pairs = keys map (x => (x,x*x*10))
pairs collect() foreach print
println(pairs.keys.collect.mkString(","))
println(pairs.values.collect.mkString(","))
4.1.5 mapValues
格式:
pairRDD mapValues ( => )
val keys = sc parallelize( 1 to 6)
val pairs = keys map (x => (x,x*x*10))
pairs collect() foreach print
pairs mapValues ( x => x /10 ) collect() foreach print
4.2 聚合操作
4.2.1 reduceByKey
格式:
pairRDD reduceByKey ( => )
val keys = sc parallelize (List("A","B","C","D"))
val values = sc parallelize (List(1,2,3,4))
val pairs = keys cartesian values
pairs collect() foreach print
pairs reduceByKey ((a,b) => ( if (a > b) a else b)) collect() foreach print
4.2.2 foldByKey
格式:
pairRDD foldByKey (value)( => )
val keys = sc parallelize (List("A","B","C","D"))
val values = sc parallelize (List(1,2,3,4))
val pairs = sc parallelize (keys cartesian values collect,1)
pairs collect() foreach print
pairs.foldByKey(3)((a,b) => (a+b)) collect()
3 +1+2+3+4 = 13
4.2.3 aggregateByKey
格式:
pairRDD aggregateByKey(value)( => , => )
val keys = sc parallelize (List("A","B","C","D"))
val values = sc parallelize (List(1,2,3,4))
val pairs = sc parallelize (keys cartesian values collect,1)
pairs collect() foreach print
pairs.aggregateByKey("")((a,b)=>(a+""+b),(s,t)=>s+t) collect()
4.2.3 combineByKey
格式:
pairRDD combineByKey( => , => , => )
第一個 => :元素轉返回類型
第一個 => :參數 元素
第二個 => :分區內元素聚合
第二個 => :參數 返回類型,元素 (參數順序不可變)
第三個 => :分區聚合
第三個 => :參數 返回類型,返回類型
val keys = sc parallelize (List("A","B","C","D"))
val values = sc parallelize (List(1,2,3,4))
val pairs = sc parallelize (keys cartesian values collect,1)
pairs combineByKey(x=>x.toDouble,(a:Double,b:Int)=>(a + b.toDouble),(a:Double,b:Double)=>(a+b)) collect
4.3 分組操作
4.3.1 groupBy
格式:
pairRDD groupBy ( => )
val keys = sc parallelize (List("A","B","C","D"))
val values = sc parallelize (List(1,2,3,4))
val pairs = sc parallelize (keys cartesian values collect,1)
pairs groupBy{case(key,value) => key} map {case(key,value) => (key, value map (x => x._2))} collect
等價於
groupByKey
val keys = sc parallelize (List("A","B","C","D"))
val values = sc parallelize (List(1,2,3,4))
val pairs = sc parallelize (keys cartesian values collect,1)
pairs groupByKey() collect
4.3.2 groupByKey
格式:
pairRDD groupByKey ( => )
val keys = sc parallelize (List("A","B","C","D"))
val values = sc parallelize (List(1,2,3,4))
val pairs = sc parallelize (keys cartesian values collect,1)
pairs groupByKey() collect
4.3.3 cogroup
格式:
pairRDD1 cogroup pairRDD2 [cogroup pairRDD3 …]
val keys = sc parallelize (List("A","B","C","D"))
val values = sc parallelize (List(1,2,3,4))
val pairs1 = sc parallelize (keys cartesian values collect)
val keys = sc parallelize(List("B","C"))
val values = sc parallelize(5 to 8)
val pairs2 = sc parallelize (keys cartesian values collect)
pairs1 collect() foreach print
pairs2 collect() foreach print
pairs1 cogroup pairs2 collect() foreach print
4.4 連接操作
4.4.1 join
格式:
pairRDD1 join pairRDD2
val keys = sc parallelize (1 to 3)
val values = sc parallelize ('A' to 'C')
val pairs1 = sc parallelize ( keys cartesian values collect)
keys collect() foreach print
values collect() foreach print
pairs1 collect() foreach print
val keys = sc parallelize (2 to 4)
val values = sc parallelize ( 'M' to 'O')
val pairs2 = sc parallelize ( keys cartesian values collect)
keys collect() foreach print
values collect() foreach print
pairs2 collect() foreach print
pairs1 join pairs2 collect() foreach print
4.4.2 leftOuterJoin
格式:
pairRDD1 leftOuterJoin pairRDD2
val keys = sc parallelize ( 1 to 3)
val values = sc parallelize( 'A' to 'C')
keys collect() foreach print
values collect() foreach print
val pairs1 = keys cartesian values
val keys = sc parallelize ( 2 to 3)
val values = sc parallelize ('A' to 'D')
keys collect() foreach print
values collect() foreach print
val pairs2 = keys cartesian values
pairs1 collect() foreach print
pairs2 collect() foreach print
pairs1 leftOuterJoin pairs2 collect() foreach print
4.4.3 rightOuterJoin
格式:
pairRDD1 rightOuterJoin pairRDD2
val keys = sc parallelize ( 1 to 2)
val values = sc parallelize ( 'A' to 'C')
val pairs1 = keys cartesian values
keys collect() foreach print
values collect() foreach print
val keys = sc parallelize ( 2 to 4)
val values = sc parallelize ('B' to 'D')
val pairs2 = keys cartesian values
keys collect() foreach print
values collect() foreach print
pairs1 collect() foreach print
pairs2 collect() foreach print
pairs1 rightOuterJoin pairs2 collect() foreach print
4.5 排序操作
4.5.1 sortByKey
格式:
pairRDD sortByKey
val keys = sc parallelize (List(3,2,1))
val values = sc parallelize ( 'M' to 'O')
val pairs = keys cartesian values
keys collect() foreach print
values collect() foreach print
pairs collect() foreach print
pairs sortByKey() collect() foreach print
5. pair RDD 行動操作
5.1 countByKey
格式:
pairRDD couuntByKey
val keys = sc parallelize( 1 to 8)
val values = sc parallelize ( 'A' to 'E')
val pairs = keys cartesian values
keys collect() foreach print
values collect() foreach print
pairs collect() foreach print
pairs countByKey
5.2 collectAsMap
格式:
pairRDD collectAsMap
val keys = sc parallelize( 1 to 7)
val values = sc parallelize( 'E' to 'H')
val pairs = keys cartesian values
keys collect() foreach print
values collect() foreach print
pairs collect() foreach print
pairs collectAsMap
5.3 lookup
格式:
pairRDD lookup key
val keys = sc parallelize ( 1 to 5)
val values = sc parallelize ( 'M' to 'Z')
val pair = keys cartesian values
keys collect() foreach print
values collect() foreach print
pair collect() foreach print
pair lookup 2