Spark -- RDD兩種算子:Transformation 和 Action

Transformation

map(func)

通過對RDD中每個元素執行一個function然後返回新的RDD

/**
   * Return a new RDD by applying a function to all elements of this RDD.
   */
  def map[U: ClassTag](f: T => U): RDD[U]

例如,將RDD中的元素倍乘2

scala> sc.parallelize(1 to 5).map(_*2).collect()
res0: Array[Int] = Array(2, 4, 6, 8, 10)

filter(func)

對每個元素執行一個function後然後選擇返回true的元素來返回一個新的數據集

/**
   * Return a new RDD containing only the elements that satisfy a predicate.
   */
  def filter(f: T => Boolean): RDD[T]

例如,選擇大於2的元素返回

scala> sc.parallelize(1 to 5).filter(_>2).collect()
res2: Array[Int] = Array(3, 4, 5)

flatMap(func)

與map類似,但是每個輸入項(元素)可以映射到0或多個輸出項(因此func應該返回一個Seq,而不是單個item)

/**
   *  Return a new RDD by first applying a function to all elements of this
   *  RDD, and then flattening the results.
   */
  def flatMap[U: ClassTag](f: T => TraversableOnce[U]): RDD[U]

例如,第一個例子將單一的元素String通過split操作變成了string數組,第二個例子就是直接將單一的元素數組直接輸出

scala> sc.parallelize(Array("redis redis spark","yarn hadoop spark")).flatMap(_.split(" ")).collect()
res17: Array[String] = Array(redis, redis, spark, yarn, hadoop, spark)

scala> sc.parallelize(Array(1 to 5, 5 to 10, 11 to 15)).flatMap(x=>x.map(y=>y)).collect()
res18: Array[Int] = Array(1, 2, 3, 4, 5, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)

mapPartitions(func)

與map類似,但是是在RDD的每個partition上單獨運行,所以func在類型爲T的RDD上運行時必須是Iterator[T] => Iterator[U]類型。
即map的輸入函數是應用於RDD中每個元素,而mapPartitions的輸入函數是應用於每個分區,也就是把每個分區中的內容作爲整體來處理的

/**
   * Return a new RDD by applying a function to each partition of this RDD.
   *
   * `preservesPartitioning` indicates whether the input function preserves the partitioner, which
   * should be `false` unless this is a pair RDD and the input function doesn't modify the keys.
   */
  def mapPartitions[U: ClassTag](
      f: Iterator[T] => Iterator[U],
      preservesPartitioning: Boolean = false): RDD[U]

例如,RDD中的元素是seq,計算每個分區中seq裏的所有元素之和

val rdd = sc.parallelize(Array(1 to 5, 5 to 10, 11 to 15),3)
val mapParRDD = rdd.mapPartitionsWithIndex((index,iter)=>{
  var num = 0
  while(iter.hasNext){
    var seq = iter.next()
    seq.map(x=>num=x+num)
    println(s"$index-----$seq")
  }
  Array(num).iterator
})
mapParRDD.collect().foreach(println)

在這裏插入圖片描述

mapPartitionsWithIndex(func)

mapPartitions類似,但是多提供了一個integer型的參數表示分區號,所以函數類型是(Int, Iterator[T]) => Iterator[U]

/**
   * Return a new RDD by applying a function to each partition of this RDD, while tracking the index
   * of the original partition.
   *
   * `preservesPartitioning` indicates whether the input function preserves the partitioner, which
   * should be `false` unless this is a pair RDD and the input function doesn't modify the keys.
   */
  def mapPartitionsWithIndex[U: ClassTag](
      f: (Int, Iterator[T]) => Iterator[U],
      preservesPartitioning: Boolean = false): RDD[U]

例子請參照上面的例子。

sample(withReplacement, fraction, seed)

使用給定的隨機數生成器種子對數據的一部分進行採樣,採樣的元素可以重複也可以不重複。

/**
   * Return a sampled subset of this RDD.
   *
   * @param withReplacement can elements be sampled multiple times (replaced when sampled out)
   * @param fraction expected size of the sample as a fraction of this RDD's size
   *  without replacement: probability that each element is chosen; fraction must be [0, 1]
   *  with replacement: expected number of times each element is chosen; fraction must be greater
   *  than or equal to 0
   * @param seed seed for the random number generator
   *
   * @note This is NOT guaranteed to provide exactly the fraction of the count
   * of the given [[RDD]].
   */
  def sample(
      withReplacement: Boolean,
      fraction: Double,
      seed: Long = Utils.random.nextLong): RDD[T]

例如

scala> var sampleRDD = sc.parallelize(1 to 10)
sampleRDD: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[49] at parallelize at <console>:24
scala> sampleRDD.sample(false,0.1).collect
res47: Array[Int] = Array(3, 5)
scala> sampleRDD.sample(true,0.2).collect
res66: Array[Int] = Array(1, 4, 4, 7, 9)

union(otherDataset)

對兩個RDD做union操作並返回新的RDD

/**
   * Return the union of this RDD and another one. Any identical elements will appear multiple
   * times (use `.distinct()` to eliminate them).
   */
  def union(other: RDD[T]): RDD[T]

例如合併兩個RDD

scala> var rdd1 = sc.parallelize(1 to 5)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[74] at parallelize at <console>:24
scala> var rdd2 = sc.parallelize(3 to 7)
rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[75] at parallelize at <console>:24
scala> rdd1.union(rdd2).collect().foreach(println)
1
2
3
4
5
3
4
5
6
7

intersection(otherDataset)

求兩個RDD的交集

/**
   * Return the intersection of this RDD and another one. The output will not contain any duplicate
   * elements, even if the input RDDs did.
   *
   * @note This method performs a shuffle internally.
   */
  def intersection(other: RDD[T]): RDD[T]

例如

scala> rdd1.intersection(rdd2).collect().foreach(println)
4
3
5

distinct([numPartitions]))

對RDD中的元素進行去重操作

/**
   * Return a new RDD containing the distinct elements in this RDD.
   */
  def distinct(): RDD[T]

例如

scala> sc.parallelize(Array(1,1,2,2,2,3,5)).distinct().collect()
res73: Array[Int] = Array(2, 1, 3, 5)

groupByKey([numPartitions])

當對元素類型爲K-V對的RDD進行groupByKey操作時,返回一個元素類型爲(K, Iterable<V>)的RDD。
注意如果分組是爲了對每個鍵執行聚合(例如求和或平均值),那麼使用reduceByKeyaggregateByKey將產生更好的性能
注意:默認情況下,輸出中的並行度取決於父RDD的分區數。您可以傳遞一個可選的numPartitions參數來設置不同數量的任務。

/**
   * Group the values for each key in the RDD into a single sequence. Hash-partitions the
   * resulting RDD with the existing partitioner/parallelism level. The ordering of elements
   * within each group is not guaranteed, and may even differ each time the resulting RDD is
   * evaluated.
   *
   * @note This operation may be very expensive. If you are grouping in order to perform an
   * aggregation (such as a sum or average) over each key, using `PairRDDFunctions.aggregateByKey`
   * or `PairRDDFunctions.reduceByKey` will provide much better performance.
   */
  def groupByKey(): RDD[(K, Iterable[V])]

例如,注意arr這個RDD的元素類型是(Tuple)元組(String, Iterable[Int])

val strArray = Array("redis redis spark yarn yarn","yarn hadoop hadoop")
    val arr = sc.parallelize(strArray).flatMap(x=>x.split(" ")).map((_,1)).groupByKey().collect()
    arr.foreach[Unit](x=>{
      println(s"key=${x._1}, iterable=${x._2}")
    })

reduceByKey(func, [numPartitions])

使用reduce function聚合每個key的所有值。該方法會在發送結果到reducer之前會在本地進行合併,類似於MR中的combiner

/**
   * Merge the values for each key using an associative and commutative reduce function. This will
   * also perform the merging locally on each mapper before sending results to a reducer, similarly
   * to a "combiner" in MapReduce. Output will be hash-partitioned with the existing partitioner/
   * parallelism level.
   */
  def reduceByKey(func: (V, V) => V): RDD[(K, V)]

例如,單詞統計

scala> val strArray = Array("redis redis spark yarn yarn","yarn hadoop hadoop")
strArray: Array[String] = Array(redis redis spark yarn yarn, yarn hadoop hadoop)

scala> sc.parallelize(strArray).flatMap(x=>x.split(" ")).map((_,1)).reduceByKey(_+_).collect().foreach(println)
(yarn,3)
(spark,1)
(hadoop,2)
(redis,2)

aggregateByKey(zeroValue)(seqOp, combOp, [numPartitions])

對每個key的值進行聚合操作。該方法是可以返回一個不同的結果類型
第一個參數zeroValue:每個key的初始值
第二個參數seqOp:用來先對每個分區內的數據按照key分別進行定義進行函數定義的操作
第三個參數combOp:對經過 Seq Function 處理過的數據按照key分別進行合併

/**
   * Aggregate the values of each key, using given combine functions and a neutral "zero value".
   * This function can return a different result type, U, than the type of the values in this RDD,
   * V. Thus, we need one operation for merging a V into a U and one operation for merging two U's,
   * as in scala.TraversableOnce. The former operation is used for merging values within a
   * partition, and the latter is used for merging values between partitions. To avoid memory
   * allocation, both of these functions are allowed to modify and return their first argument
   * instead of creating a new U.
   */
  def aggregateByKey[U: ClassTag](zeroValue: U)(seqOp: (U, V) => U,
      combOp: (U, U) => U): RDD[(K, U)]

例如,特別注意下面的yarn是23,爲什麼呢?因爲0號分區和1號分區都有yarn,所有初始值10增加了兩次

scala> val strArray = Array("redis redis spark yarn yarn","yarn hadoop hadoop")
strArray: Array[String] = Array(redis redis spark yarn yarn, yarn hadoop hadoop)

scala> var mapRDD = sc.parallelize(strArray).flatMap(x=>x.split(" ")).map((_,1))
mapRDD: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[117] at map at <console>:26

scala> mapRDD.mapPartitionsWithIndex((index,ite)=>{
     |   ite.map(x=>(index,x))
     | }).collect().foreach(println)
(0,(redis,1))
(0,(redis,1))
(0,(spark,1))
(0,(yarn,1))
(0,(yarn,1))
(1,(yarn,1))
(1,(hadoop,1))
(1,(hadoop,1))

scala> mapRDD.aggregateByKey(10)((u,v)=>u+v,_+_).collect().foreach(println)
(yarn,23)
(spark,11)
(hadoop,12)
(redis,12)

求平均值,可以用groupByKey,也可以用reduceByKey

// 方法一,用groupByKey進行分組,然後用map求平均值,好理解
var numRDD = sc.parallelize(1 to 5).map(x=>("num",x)).groupByKey().map(x=>{
  var ct=0
  var num=0
  x._2.foreach(a=>{
    num=a+num
    ct+=1
  })
  (x._1,num/ct)
})
numRDD.collect()

//方法二,先用map把key的值和key出現的次數用元組記錄起來,然後用reduceByKey方法計算平均值
sc.parallelize(1 to 5).map(x=>("num",x)).map(x=>(x._1,(x._2,1))).reduceByKey((a,b)=>{
  (a._1+b._1,a._2+b._2)
}).map(x=>(x._1,x._2._1/x._2._2)).collect()

sortByKey([ascending], [numPartitions])

根據key對RDD進行排序,返回一個有序的RDD即每個partition內的數據都是有序的。全部返回給driver program會是全局有序的。
還可以通過指定ascending=false來降序

/**
   * Sort the RDD by key, so that each partition contains a sorted range of the elements. Calling
   * `collect` or `save` on the resulting RDD will return or output an ordered list of records
   * (in the `save` case, they will be written to multiple `part-X` files in the filesystem, in
   * order of the keys).
   */
  // TODO: this currently doesn't work on P other than Tuple2!
  def sortByKey(ascending: Boolean = true, numPartitions: Int = self.partitions.length)
      : RDD[(K, V)]

例如

scala> val path = "/user/root/input/words.txt"
path: String = /user/root/input/words.txt
scala> val fileRDD = sc.textFile(path)
fileRDD: org.apache.spark.rdd.RDD[String] = /user/root/input/words.txt MapPartitionsRDD[1] at textFile at <console>:26
scala> fileRDD.flatMap(x=>x.split(" ")).map((_,1)).sortByKey(false).collect().foreach(println)
(spark.2,1)
(spark.2,1)
(spark.1,1)
(redis.4,1)
(redis.4,1)
(redis.3,1)
(redis.2,1)
(redis.1,1)
(redis.1,1)
(flume.4,1)
(flume.3,1)
(flume.3,1)

join(otherDataset, [numPartitions])

根據key來做兩個RDD之間的join。可以參考數據庫中兩個表的join操作。
同樣地也有leftOuterJoin、rightOuterJoin、fullOuterJoin

/**
   * Return an RDD containing all pairs of elements with matching keys in `this` and `other`. Each
   * pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in `this` and
   * (k, v2) is in `other`. Performs a hash join across the cluster.
   */
  def join[W](other: RDD[(K, W)]): RDD[(K, (V, W))]

例如

val path = "/user/root/input/words.txt"
val fileRDD = sc.textFile(path)
val wcRDD1 = fileRDD.flatMap(x=>x.split(" ")).map((_,1))

val strArray = Array("redis.1 redis.2 spark.2 yarn yarn flume.4","yarn hadoop hadoop")
val wcRDD2 = sc.parallelize(strArray).flatMap(x=>x.split(" ")).map((_,1))

wcRDD1.collect().foreach(println)
wcRDD2.collect().foreach(println)
wcRDD1.join(wcRDD2).collect().foreach(println)
wcRDD1.leftOuterJoin(wcRDD2).collect().foreach(println)
wcRDD1.rightOuterJoin(wcRDD2).collect().foreach(println)
wcRDD1.fullOuterJoin(wcRDD2).collect().foreach(println)

在這裏插入圖片描述
在這裏插入圖片描述

cogroup(otherDataset, [numPartitions])

如果輸入的RDD類型爲(K, V) 和(K, W),則返回的RDD類型爲 (K, (Iterable[V], Iterable[W])) . 該操作與 groupWith等同

/**
   * For each key k in `this` or `other`, return a resulting RDD that contains a tuple with the
   * list of values for that key in `this` as well as `other`.
   */
  def cogroup[W](other: RDD[(K, W)]): RDD[(K, (Iterable[V], Iterable[W]))]

例如,基於上面例子中join的代碼

wcRDD1.cogroup(wcRDD2).collect().foreach(println)

在這裏插入圖片描述

cartesian(otherDataset)

對兩個RDD做笛卡爾積,RDD[T] 笛卡爾 RDD[U],返回RDD[(T, U)]

/**
   * Return the Cartesian product of this RDD and another one, that is, the RDD of all pairs of
   * elements (a, b) where a is in `this` and b is in `other`.
   */
  def cartesian[U: ClassTag](other: RDD[U]): RDD[(T, U)]

例如

val numRDD1 = sc.parallelize(1 to 2)
val numRDD2 = sc.parallelize(3 to 5)
numRDD1.cartesian(numRDD2).collect().foreach(println)

在這裏插入圖片描述

pipe(command, [envVars])

對RDD的每個partition都調用外部程序。通過pipe(),你可以將RDD中的各元素從標準輸入流中以字符串形式讀出,並對這些元素執行任何你需要的操作,然後把結果以字符串的形式寫入標準輸出。通過這個方法可以與shell、python等其他語言協作完成計算。

/**
   * Return an RDD created by piping elements to a forked external process.
   */
  def pipe(command: String, env: Map[String, String]): RDD[String]

例如,我們以Hadoop Streaming中的例子爲例,調用python版本的reduce.py文件來執行reduce操作。由於RDD中的partition是分佈到各個機器的Executor進程裏,所有腳本文件需要在每個機器上都存在。其中我們還傳了環境變量信息。

import sys
import os

# 存儲<word,count>的字典 根本不需要, 這樣極大地浪費存儲空間
word2count = {}
# 需要臨時變量存儲key即word,也需要臨時變量count來統計
word = ""
count = 0
started = 0
# 從標準輸入一行行讀取數據
for line in sys.stdin:
    # 去除首尾空格
    line = line.strip()
    # 獲取單詞和單詞數
    newword, newcount = line.split()
    if word != newword:
        if started == 1:
            # 打印上一輪的單詞統計
            print "{}\t{}".format(word, count)
        word = newword
        count = int(newcount)
        started = 1
    else:
        count = count + int(newcount)
    # word2count[word] = word2count.get(word, 0) + int(count)

print "{}\t{}".format(word, count)
print "{}\t{}".format(os.getenv("red"), 1)
print "{}\t{}".format(os.getenv("azure"), 1)
# 不準這樣輸出, 這樣會失去map輸出數據的順序性
# for word, count in word2count.items():
#     print "{}\t{}".format(word, count)

val strArray = Array("redis redis spark yarn yarn","yarn hadoop hadoop")
val rdd = sc.parallelize(strArray).flatMap(x=>x.split(" ")).map(x=>x+"\t"+1)
val colors = Map("red" -> "#FF0000", "azure" -> "#F0FFFF")
rdd.pipe("/tmp/reduce.py",colors).collect().foreach(println)

在這裏插入圖片描述

coalesce(numPartitions)

將RDD的分區數減至指定的numPartitions分區數。通常在用filter算子過濾掉大量數據後再執行coalesce會執行的更加高效。
這會造成narrow dependency。如果從1000個分區降到10個分區,那麼就不會有shuffle操作,而是每個新分區佔用10個當前分區。如果是設置更高的分區數,會保留當前分區。
然而如果做一個很極致的coalesce,如設置分區數爲1,則有可能造成計算只會在很少的節點上運行。爲了避免這種情況,可以添加shuffle=true。但是會意味着當前的upstream partition會並行執行。
注意:如果添加shuffle=true,那麼就真的可以設置更高的分區數。這是很有用的,如果你有很少的分區數,但是可能存在幾個分區數據量異常的大。這個時候調用coalesce(1000, shuffle = true)將會使用has partitioner將數據分發至1000個分區。可選參數partition coalescer一定要是序列化的。

/**
   * Return a new RDD that is reduced into `numPartitions` partitions.
   *
   * This results in a narrow dependency, e.g. if you go from 1000 partitions
   * to 100 partitions, there will not be a shuffle, instead each of the 100
   * new partitions will claim 10 of the current partitions. If a larger number
   * of partitions is requested, it will stay at the current number of partitions.
   *
   * However, if you're doing a drastic coalesce, e.g. to numPartitions = 1,
   * this may result in your computation taking place on fewer nodes than
   * you like (e.g. one node in the case of numPartitions = 1). To avoid this,
   * you can pass shuffle = true. This will add a shuffle step, but means the
   * current upstream partitions will be executed in parallel (per whatever
   * the current partitioning is).
   *
   * @note With shuffle = true, you can actually coalesce to a larger number
   * of partitions. This is useful if you have a small number of partitions,
   * say 100, potentially with a few partitions being abnormally large. Calling
   * coalesce(1000, shuffle = true) will result in 1000 partitions with the
   * data distributed using a hash partitioner. The optional partition coalescer
   * passed in must be serializable.
   */
  def coalesce(numPartitions: Int, shuffle: Boolean = false,
               partitionCoalescer: Option[PartitionCoalescer] = Option.empty)
              (implicit ord: Ordering[T] = null)
      : RDD[T]

例如

scala> rdd.partitions.length
res15: Int = 2

scala> rdd.coalesce(1).partitions.length
res16: Int = 1

scala> rdd.coalesce(3).partitions.length
res17: Int = 2

scala> rdd.coalesce(3,true).partitions.length
res18: Int = 3

repartition(numPartitions)

隨機地重新shuffleRDD中的數據,以創建更多或更少的分區,並在它們之間進行平衡。這總是通過網絡shuffle所有的數據,即一定會造成shuffle操作。如果是減少分區數,可以考慮用上面的coalesce

/**
   * Return a new RDD that has exactly numPartitions partitions.
   *
   * Can increase or decrease the level of parallelism in this RDD. Internally, this uses
   * a shuffle to redistribute data.
   *
   * If you are decreasing the number of partitions in this RDD, consider using `coalesce`,
   * which can avoid performing a shuffle.
   *
   * TODO Fix the Shuffle+Repartition data loss issue described in SPARK-23207.
   */
  def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T]

例如

scala> rdd.repartition(1).partitions.length
res20: Int = 1

repartitionAndSortWithinPartitions(partitioner)

根據指定的分區器對RDD進行重分區,並且,重分區後對每個分區的數據根據key進行排序,即保證區內有序。
如果想重分區後再在每個分區內排序,可以調用該方法repartitionAndSortWithinPartitions(partitioner),這是更加有效的,因爲會在shuffle的過程就進行排序。

/**
   * Repartition the RDD according to the given partitioner and, within each resulting partition,
   * sort records by their keys.
   *
   * This is more efficient than calling `repartition` and then sorting within each partition
   * because it can push the sorting down into the shuffle machinery.
   */
  def repartitionAndSortWithinPartitions(partitioner: Partitioner): RDD[(K, V)]

例如

val strArray = Array("redis redis spark yarn yarn","yarn hadoop hadoop")
val rdd = sc.parallelize(strArray).flatMap(x=>x.split(" ")).map((_,1))
val newRDD = rdd.repartitionAndSortWithinPartitions(new org.apache.spark.HashPartitioner(3))

在這裏插入圖片描述

Action

reduce(func)

使用函數func(它接受兩個參數並返回一個)來聚合RDD中的元素。這個函數應該符合交換律和結合律,這樣它才能被正確地並行計算。

/**
   * Reduces the elements of this RDD using the specified commutative and
   * associative binary operator.
   */
  def reduce(f: (T, T) => T): T

例如,求和

scala> sc.parallelize(1 to 10).reduce(_+_)
res34: Int = 55

collect()

將RDD中的所有元素以數組的形式返回給driver program,注意最好是在使用filter算子過濾掉大量的數據後且確保返回的數據充分小的時候使用,因爲返回的數據集會全部加載到driver program節點的內存裏

/**
   * Return an array that contains all of the elements in this RDD.
   *
   * @note This method should only be used if the resulting array is expected to be small, as
   * all the data is loaded into the driver's memory.
   */
  def collect(): Array[T]

例如

scala> sc.parallelize(1 to 10).collect()
res35: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

count()

計算RDD中元素的個數

/**
   * Return the number of elements in the RDD.
   */
  def count(): Long

例如,計算文件有多少行

scala> sc.textFile("/user/root/input/words.txt").count()
res36: Long = 3

scala> sc.textFile("/user/root/input/words.txt").collect().foreach(println)
redis.1 redis.1 redis.2 redis.3 redis.4 redis.4
spark.1 spark.2 spark.2
flume.3 flume.3 flume.4

first()

獲取RDD中第一個元素,等同於take(1)

/**
   * Return the first element in this RDD.
   */
  def first(): T

take(n)

獲取RDD中前n個元素,會一個分區一個分區的掃描知道滿足返回數量

/**
   * Take the first num elements of the RDD. It works by first scanning one partition, and use the
   * results from that partition to estimate the number of additional partitions needed to satisfy
   * the limit.
   *
   * @note This method should only be used if the resulting array is expected to be small, as
   * all the data is loaded into the driver's memory.
   *
   * @note Due to complications in the internal implementation, this method will raise
   * an exception if called on an RDD of `Nothing` or `Null`.
   */
  def take(num: Int): Array[T]

例如,

scala> val numRDD = sc.parallelize(1 to 10,3)
numRDD: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[51] at parallelize at <console>:24
scala> numRDD.take(4)
res39: Array[Int] = Array(1, 2, 3, 4)

scala> numRDD.take(1)
res40: Array[Int] = Array(1)

scala> numRDD.first()
res41: Int = 1

takeSample(withReplacement, num, [seed])

返回一個包含數據集的隨機num元素樣本的數組,可以替換,也可以不替換,可以預先指定隨機數生成器種子。

/**
   * Return a fixed-size sampled subset of this RDD in an array
   *
   * @param withReplacement whether sampling is done with replacement
   * @param num size of the returned sample
   * @param seed seed for the random number generator
   * @return sample of specified size in an array
   *
   * @note this method should only be used if the resulting array is expected to be small, as
   * all the data is loaded into the driver's memory.
   */
  def takeSample(
      withReplacement: Boolean,
      num: Int,
      seed: Long = Utils.random.nextLong): Array[T]

例如

scala> numRDD.takeSample(true,6)
res42: Array[Int] = Array(2, 9, 9, 10, 2, 7)

scala> numRDD.takeSample(false,6)
res43: Array[Int] = Array(5, 9, 6, 10, 3, 8)

takeOrdered(n, [ordering])

使用自然順序或者比較器返回前n個元素

/**
   * Returns the first k (smallest) elements from this RDD as defined by the specified
   * implicit Ordering[T] and maintains the ordering. This does the opposite of [[top]].
   * For example:
   * {{{
   *   sc.parallelize(Seq(10, 4, 2, 12, 3)).takeOrdered(1)
   *   // returns Array(2)
   *
   *   sc.parallelize(Seq(2, 3, 4, 5, 6)).takeOrdered(2)
   *   // returns Array(2, 3)
   * }}}
   *
   * @note This method should only be used if the resulting array is expected to be small, as
   * all the data is loaded into the driver's memory.
   *
   * @param num k, the number of elements to return
   * @param ord the implicit ordering for T
   * @return an array of top elements
   */
  def takeOrdered(num: Int)(implicit ord: Ordering[T]): Array[T]

例如

scala> numRDD.takeOrdered(3)
res45: Array[Int] = Array(1, 2, 3)

top(n, [ordering])

和takeOrdered類似,只不過底層實現就是用的takeOrdered即takeOrdered(num)(ord.reverse)

/**
   * Returns the top k (largest) elements from this RDD as defined by the specified
   * implicit Ordering[T] and maintains the ordering. This does the opposite of
   * [[takeOrdered]]. For example:
   * {{{
   *   sc.parallelize(Seq(10, 4, 2, 12, 3)).top(1)
   *   // returns Array(12)
   *
   *   sc.parallelize(Seq(2, 3, 4, 5, 6)).top(2)
   *   // returns Array(6, 5)
   * }}}
   *
   * @note This method should only be used if the resulting array is expected to be small, as
   * all the data is loaded into the driver's memory.
   *
   * @param num k, the number of top elements to return
   * @param ord the implicit ordering for T
   * @return an array of top elements
   */
  def top(num: Int)(implicit ord: Ordering[T]): Array[T]

例如

scala> numRDD.top(2)
res46: Array[Int] = Array(10, 9)

saveAsTextFile(path)

將RDD中的元素作爲文本文件(或文本文件集)寫入本地文件系統、HDFS或任何其他hadoop支持的文件系統的給定目錄中。Spark將對每個元素調用toString,將其轉換爲文件中的一行文本

/**
   * Save this RDD as a text file, using string representations of elements.
   */
  def saveAsTextFile(path: String): Unit

例如

val strArray = Array("redis redis spark yarn yarn","yarn hadoop hadoop")
val rdd = sc.parallelize(strArray).flatMap(x=>x.split(" ")).map((_,1)).reduceByKey(_+_)
rdd.saveAsTextFile("/tmp/rdd.5")

在這裏插入圖片描述

saveAsSequenceFile(path)

只有鍵值對RDD才能調用。將RDD文件輸出成Hadoop SequenceFile

/**
   * Output the RDD as a Hadoop SequenceFile using the Writable types we infer from the RDD's key
   * and value types. If the key or value are Writable, then we use their classes directly;
   * otherwise we map primitive types such as Int and Double to IntWritable, DoubleWritable, etc,
   * byte arrays to BytesWritable, and Strings to Text. The `path` can be on any Hadoop-supported
   * file system.
   */
  def saveAsSequenceFile(
      path: String,
      codec: Option[Class[_ <: CompressionCodec]] = None): Unit

例如,用saveAsSequenceFile將RDD數據存儲爲Hadoop SequenceFile文件,並用sequenceFile讀取Hadoop SequenceFile文件

rdd.saveAsSequenceFile("/tmp/rdd.6")
scala> sc.sequenceFile[String,Int]("/tmp/rdd.6")
res59: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[77] at sequenceFile at <console>:26
scala> res59.collect()
res60: Array[(String, Int)] = Array((yarn,3), (spark,1), (hadoop,2), (redis,2))

在這裏插入圖片描述

saveAsObjectFile(path)

以Java序列化對象的方式將RDD中數據輸出,可以用SparkContext.objectFile()來加載

/**
   * Save this RDD as a SequenceFile of serialized objects.
   */
  def saveAsObjectFile(path: String): Unit

例如,將RDD中的數據以序列化對象的方式輸出,可以用SparkContext.objectFile()來加載回來

scala> import scala.Tuple2
scala> rdd.saveAsObjectFile("/tmp/rdd.7")
scala> sc.objectFile[Tuple2[String,Int]]("/tmp/rdd.7").collect()
res66: Array[(String, Int)] = Array((yarn,3), (spark,1), (hadoop,2), (redis,2))

countByKey()

僅KV對類型的RDD纔有該方法,獲取每個key的個數,收集後返回一個Map對象給driver program
如果結果數據量很大,建議用 rdd.mapValues(_ => 1L).reduceByKey(_ + _) 這樣是返回一個RDD。底層就是這麼實現的。

/**
   * Count the number of elements for each key, collecting the results to a local Map.
   *
   * @note This method should only be used if the resulting map is expected to be small, as
   * the whole thing is loaded into the driver's memory.
   * To handle very large results, consider using rdd.mapValues(_ => 1L).reduceByKey(_ + _), which
   * returns an RDD[T, Long] instead of a map.
   */
  def countByKey(): Map[K, Long] = self.withScope {
    self.mapValues(_ => 1L).reduceByKey(_ + _).collect().toMap
  }

例如

scala> val strArray = Array("redis redis spark yarn yarn","yarn hadoop hadoop")
strArray: Array[String] = Array(redis redis spark yarn yarn, yarn hadoop hadoop)

scala> val rdd = sc.parallelize(strArray).flatMap(x=>x.split(" ")).map((_,1))
rdd: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[88] at map at <console>:27

scala> rdd.countByKey()
res68: scala.collection.Map[String,Long] = Map(yarn -> 3, spark -> 1, hadoop -> 2, redis -> 2)

foreach(func)

對RDD中的每個元素運行函數func。這通常是爲了解決諸如更新Accumulator或與外部存儲系統交互的問題。
注意:在foreach()之外修改除Accumulator之外的其他變量可能會導致未定義的行爲。有關更多細節,請參考Understanding closures
foreachforeachPartition在代碼實現上幾乎是一樣的

/**
   * Applies a function f to all elements of this RDD.
   */
  def foreach(f: T => Unit): Unit = withScope {
    val cleanF = sc.clean(f)
    sc.runJob(this, (iter: Iterator[T]) => iter.foreach(cleanF))
  }

例如,注意下面中並沒有元素打印,不是因爲RDD沒有數據,而是因爲數據打印在executor節點上

scala> rdd.foreach(println)

scala>

在這裏插入圖片描述

異步版本的Action算子

Spark RDD API還公開了一些操作的異步版本,比如foreachAsync相對應foreach,它立即向調用者返回一個FutureAction,而不是阻塞。這可以用於管理或等待操作的異步執行。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章