文章目錄
- Spark API文檔
- Value類型 Transformation 算子分類
- Transformation-map
- Transformation-mapPartitions
- Transformation-flatMap
- Transformation-flatMap
- Transformation-union
- Transformation-distinct
- Transformation-filter
- Transformation-intersection
- Key-Value類型 Transformation 算子分類
- Transformation-groupByKey
- Transformation-groupByKey
- Transformation-reduceByKey
- Transformation-aggregateByKey
- Transformation-join
- Action 算子分類
核心思想:
-
對於RDD有四種類型的算子
- Create
- SparkContext.textFile()
- SparkContext.parallelize()
- Transformation
- 作用於一個或者多個RDD,輸出轉換後的RDD
- 例如:map, filter, groupBy
- Action
- 會觸發Spark提交作業,並將結果返回Driver Program
- 例如:reduce, countByKey
- Cache
- cache 緩存
- persist 持久化
- Create
-
惰性運算:遇到Action時纔會真正的執行。
-
Example
-
運行Spark方式
- CDH 集羣上運行Spark-Shell
- 在Shell中輸入spark-shell --master yarn-client
- 使用Zeppelin
- sudo docker run -p 8080:8080 --rm --name zeppelin apache/zeppelin:0.7.3
- https://zeppelin.apache.org
- 使用Spark-Submit遞交作業
- CDH 集羣上運行Spark-Shell
Spark API文檔
訪問官方文檔:https://spark.apache.org/docs/latest/
Value類型 Transformation 算子分類
Transformation-map
- map
- def map[U](f: (T) ⇒ U)(implicit arg0: ClassTag[U]):RDD[U]
- 生成一個新的RDD,新的RDD中每個元素均有父RDD通過作用func函數映射變換而來
- 新的RDD叫做MappedRDD
- Example
val rd1 = sc.parallelize(List(1, 2, 3, 4, 5, 6), 2)
val rd2 = rd1.map(x => x * 2)
rd2.collect()
rd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[2] at
parallelize
rd2: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[3] at map
res1: Array[Int] = Array(2, 4, 6, 8, 10, 12)
Transformation-mapPartitions
- mapPartitions
- def mapPartitions[U](f: (Iterator[T]) => Iterator[U],
preservesPartitioning: Boolean = false)(implicit arg0: ClassTag[U]):
RDD[U] - 獲取到每個分區的迭代器
- 對每個分區中每個元素進行操作
- def mapPartitions[U](f: (Iterator[T]) => Iterator[U],
- Example
val rd1 = sc.parallelize(List("20180101", "20180102", "20180103", "20180104", "20180105",
"20180106"), 2)
val rd2 = rd1.mapPartitions(iter => {
val dateFormat = new java.text.SimpleDateFormat("yyyyMMdd")
iter.map(dateStr => dateFormat.parse(dateStr))
})
rd2.collect()
res1: Array[java.util.Date] = Array(Mon Jan 01 00:00:00 UTC 2018, Tue Jan 02 00:00:00 UTC 2018, Wed Jan 03
00:00:00 UTC 2018, Thu Jan 04 00:00:00 UTC 2018, Fri Jan 05 00:00:00 UTC 2018, Sat Jan 06 00:00:00 UTC 2018)
Transformation-flatMap
- flatMap
- def flatMap[U](f: (T) ⇒ TraversableOnce[U])(implicit arg0: ClassTag[U]): RDD[U]
- 將RDD中的每個元素通過func轉換爲新的元素
- 進行扁平化:合併所有的集合爲一個新集合
- 新的RDD叫做FlatMappedRDD
- Example
val rd1 = sc.parallelize(Seq("I have a pen",
"I have an apple",
"I have a pen",
"I have a pineapple"), 2)
val rd2 = rd1.map(s => s.split(" "))
rd2.collect()
val rd3 = rd1.flatMap(s => s.split(" "))
rd3.collect()
rd3.partitions
res136: Array[Array[String]] = Array(Array(I, have, a, pen), Array(I, have, an, apple), Array(I, have, a, pen), Array(I, have,
a, pineapple))
res137: Array[String] = Array(I, have, a, pen, I, have, an, apple, I, have, a, pen, I, have, a, pineapple)
Transformation-flatMap
Transformation-union
- union
- def union(other: RDD[T]): RDD[T]
- 合併兩個RDD
- 元素數據類型需要相同,並不進行去重操作
- Example
val rdd1 = sc.parallelize(Seq("Apple", "Banana", "Orange"))
val rdd2 = sc.parallelize(Seq("Banana", "Pineapple"))
val rdd3 = sc.parallelize(Seq("Durian"))
val rddUnion = rdd1.union(rdd2).union(rdd3)
rddUnion.collect.foreach(println)
res1: Array[String] = Array(Apple, Banana, Orange, Banana, Pineapple, Durian)
Transformation-distinct
- distinct
- def distinct(): RDD[T]
- 對RDD中的元素進行去重操作
- Example
val rdd1 = sc.parallelize(Seq("Apple", "Banana", "Orange"))
val rdd2 = sc.parallelize(Seq("Banana", "Pineapple"))
val rdd3 = sc.parallelize(Seq("Durian"))
val rddUnion = rdd1.union(rdd2).union(rdd3)
val rddDistinct = rddUnion.distinct()
rddDistinct.collect()
res1: Array[String] = Array(Orange, Apple, Banana, Pineapple, Durian)
Transformation-filter
- filter
- def filter(f: (T) ⇒ Boolean): RDD[T]
- 對RDD元素的數據進行過濾
- 當滿足f返回值爲true時保留元素,否則丟棄
- Example
val rdd1 = sc.parallelize(Seq("Apple", "Banana", "Orange"))
val filteredRDD = rdd1.filter(item => item.length() >= 6)
filteredRDD.collect()
res1: Array[String] = Array(Banana, Orange)
Transformation-intersection
- interesction
- def intersection(other: RDD[T]): RDD[T]
- def intersection(other: RDD[T], numPartitions: Int): RDD[T]
- def intersection(other: RDD[T], partitioner: Partitioner)(implicit ord: Ordering[T] = null): RDD[T]
- 對兩個RDD元素取交集
- Example
val rdd1 = sc.parallelize(Seq("Apple", "Banana", "Orange"))
val rdd2 = sc.parallelize(Seq("Banana", "Pineapple"))
val rddIntersection = rdd1.intersection(rdd2)
rddIntersection.collect()
res1: Array[String] = Array(Banana)
Key-Value類型 Transformation 算子分類
Transformation-groupByKey
- groupByKey
- def groupByKey(): RDD[(K, Iterable[V])]
- def groupByKey(numPartitions: Int): RDD[(K, Iterable[V])]
- def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])]
- 對RDD[Key, Value]按照相同的key進行分組
- Example
val scoreDetail = sc.parallelize(List(("xiaoming","A"), ("xiaodong","B"),
("peter","B"), ("liuhua","C"), ("xiaofeng","A")), 3)
scoreDetail.map(score_info => (score_info._2, score_info._1))
.groupByKey()
.collect()
.foreach(println(_))
scoreDetail: org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[110] at parallelize
(A,CompactBuffer(xiaoming, xiaofeng))
(B,CompactBuffer(xiaodong, peter))
(C,CompactBuffer(lihua))
Transformation-groupByKey
Transformation-reduceByKey
- reduceByKey
- Example
Transformation-aggregateByKey
- 如何分組計算平均值?
[(A,110),(A,130),(A,120),(B,200),(B,206),(B,206),(C,150),(C,160),(C,170)]
Transformation-join
Action 算子分類