文章目錄
- Transformation
- map(func)
- filter(func)
- flatMap(func)
- mapPartitions(func)
- mapPartitionsWithIndex(func)
- sample(withReplacement, fraction, seed)
- union(otherDataset)
- intersection(otherDataset)
- distinct([numPartitions]))
- groupByKey([numPartitions])
- reduceByKey(func, [numPartitions])
- aggregateByKey(zeroValue)(seqOp, combOp, [numPartitions])
- sortByKey([ascending], [numPartitions])
- join(otherDataset, [numPartitions])
- cogroup(otherDataset, [numPartitions])
- cartesian(otherDataset)
- pipe(command, [envVars])
- coalesce(numPartitions)
- repartition(numPartitions)
- repartitionAndSortWithinPartitions(partitioner)
- Action
Transformation
map(func)
通過對RDD中每個元素執行一個function然後返回新的RDD
/**
* Return a new RDD by applying a function to all elements of this RDD.
*/
def map[U: ClassTag](f: T => U): RDD[U]
例如,將RDD中的元素倍乘2
scala> sc.parallelize(1 to 5).map(_*2).collect()
res0: Array[Int] = Array(2, 4, 6, 8, 10)
filter(func)
對每個元素執行一個function後然後選擇返回true的元素來返回一個新的數據集
/**
* Return a new RDD containing only the elements that satisfy a predicate.
*/
def filter(f: T => Boolean): RDD[T]
例如,選擇大於2的元素返回
scala> sc.parallelize(1 to 5).filter(_>2).collect()
res2: Array[Int] = Array(3, 4, 5)
flatMap(func)
與map類似,但是每個輸入項(元素)可以映射到0或多個輸出項(因此func應該返回一個Seq
,而不是單個item)
/**
* Return a new RDD by first applying a function to all elements of this
* RDD, and then flattening the results.
*/
def flatMap[U: ClassTag](f: T => TraversableOnce[U]): RDD[U]
例如,第一個例子將單一的元素String通過split操作變成了string數組,第二個例子就是直接將單一的元素數組直接輸出
scala> sc.parallelize(Array("redis redis spark","yarn hadoop spark")).flatMap(_.split(" ")).collect()
res17: Array[String] = Array(redis, redis, spark, yarn, hadoop, spark)
scala> sc.parallelize(Array(1 to 5, 5 to 10, 11 to 15)).flatMap(x=>x.map(y=>y)).collect()
res18: Array[Int] = Array(1, 2, 3, 4, 5, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
mapPartitions(func)
與map類似,但是是在RDD的每個partition上單獨運行,所以func在類型爲T的RDD上運行時必須是Iterator[T] => Iterator[U]
類型。
即map的輸入函數是應用於RDD中每個元素,而mapPartitions的輸入函數是應用於每個分區,也就是把每個分區中的內容作爲整體來處理的
/**
* Return a new RDD by applying a function to each partition of this RDD.
*
* `preservesPartitioning` indicates whether the input function preserves the partitioner, which
* should be `false` unless this is a pair RDD and the input function doesn't modify the keys.
*/
def mapPartitions[U: ClassTag](
f: Iterator[T] => Iterator[U],
preservesPartitioning: Boolean = false): RDD[U]
例如,RDD中的元素是seq,計算每個分區中seq裏的所有元素之和
val rdd = sc.parallelize(Array(1 to 5, 5 to 10, 11 to 15),3)
val mapParRDD = rdd.mapPartitionsWithIndex((index,iter)=>{
var num = 0
while(iter.hasNext){
var seq = iter.next()
seq.map(x=>num=x+num)
println(s"$index-----$seq")
}
Array(num).iterator
})
mapParRDD.collect().foreach(println)
mapPartitionsWithIndex(func)
與mapPartitions
類似,但是多提供了一個integer型的參數表示分區號,所以函數類型是(Int, Iterator[T]) => Iterator[U]
/**
* Return a new RDD by applying a function to each partition of this RDD, while tracking the index
* of the original partition.
*
* `preservesPartitioning` indicates whether the input function preserves the partitioner, which
* should be `false` unless this is a pair RDD and the input function doesn't modify the keys.
*/
def mapPartitionsWithIndex[U: ClassTag](
f: (Int, Iterator[T]) => Iterator[U],
preservesPartitioning: Boolean = false): RDD[U]
例子請參照上面的例子。
sample(withReplacement, fraction, seed)
使用給定的隨機數生成器種子對數據的一部分進行採樣,採樣的元素可以重複也可以不重複。
/**
* Return a sampled subset of this RDD.
*
* @param withReplacement can elements be sampled multiple times (replaced when sampled out)
* @param fraction expected size of the sample as a fraction of this RDD's size
* without replacement: probability that each element is chosen; fraction must be [0, 1]
* with replacement: expected number of times each element is chosen; fraction must be greater
* than or equal to 0
* @param seed seed for the random number generator
*
* @note This is NOT guaranteed to provide exactly the fraction of the count
* of the given [[RDD]].
*/
def sample(
withReplacement: Boolean,
fraction: Double,
seed: Long = Utils.random.nextLong): RDD[T]
例如
scala> var sampleRDD = sc.parallelize(1 to 10)
sampleRDD: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[49] at parallelize at <console>:24
scala> sampleRDD.sample(false,0.1).collect
res47: Array[Int] = Array(3, 5)
scala> sampleRDD.sample(true,0.2).collect
res66: Array[Int] = Array(1, 4, 4, 7, 9)
union(otherDataset)
對兩個RDD做union操作並返回新的RDD
/**
* Return the union of this RDD and another one. Any identical elements will appear multiple
* times (use `.distinct()` to eliminate them).
*/
def union(other: RDD[T]): RDD[T]
例如合併兩個RDD
scala> var rdd1 = sc.parallelize(1 to 5)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[74] at parallelize at <console>:24
scala> var rdd2 = sc.parallelize(3 to 7)
rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[75] at parallelize at <console>:24
scala> rdd1.union(rdd2).collect().foreach(println)
1
2
3
4
5
3
4
5
6
7
intersection(otherDataset)
求兩個RDD的交集
/**
* Return the intersection of this RDD and another one. The output will not contain any duplicate
* elements, even if the input RDDs did.
*
* @note This method performs a shuffle internally.
*/
def intersection(other: RDD[T]): RDD[T]
例如
scala> rdd1.intersection(rdd2).collect().foreach(println)
4
3
5
distinct([numPartitions]))
對RDD中的元素進行去重操作
/**
* Return a new RDD containing the distinct elements in this RDD.
*/
def distinct(): RDD[T]
例如
scala> sc.parallelize(Array(1,1,2,2,2,3,5)).distinct().collect()
res73: Array[Int] = Array(2, 1, 3, 5)
groupByKey([numPartitions])
當對元素類型爲K-V對的RDD進行groupByKey操作時,返回一個元素類型爲(K, Iterable<V>)
的RDD。
注意:如果分組是爲了對每個鍵執行聚合(例如求和或平均值),那麼使用reduceByKey
或aggregateByKey
將產生更好的性能。
注意:默認情況下,輸出中的並行度取決於父RDD的分區數。您可以傳遞一個可選的numPartitions參數來設置不同數量的任務。
/**
* Group the values for each key in the RDD into a single sequence. Hash-partitions the
* resulting RDD with the existing partitioner/parallelism level. The ordering of elements
* within each group is not guaranteed, and may even differ each time the resulting RDD is
* evaluated.
*
* @note This operation may be very expensive. If you are grouping in order to perform an
* aggregation (such as a sum or average) over each key, using `PairRDDFunctions.aggregateByKey`
* or `PairRDDFunctions.reduceByKey` will provide much better performance.
*/
def groupByKey(): RDD[(K, Iterable[V])]
例如,注意arr這個RDD的元素類型是(Tuple)元組(String, Iterable[Int])
val strArray = Array("redis redis spark yarn yarn","yarn hadoop hadoop")
val arr = sc.parallelize(strArray).flatMap(x=>x.split(" ")).map((_,1)).groupByKey().collect()
arr.foreach[Unit](x=>{
println(s"key=${x._1}, iterable=${x._2}")
})
reduceByKey(func, [numPartitions])
使用reduce function聚合每個key的所有值。該方法會在發送結果到reducer之前會在本地進行合併,類似於MR中的combiner
。
/**
* Merge the values for each key using an associative and commutative reduce function. This will
* also perform the merging locally on each mapper before sending results to a reducer, similarly
* to a "combiner" in MapReduce. Output will be hash-partitioned with the existing partitioner/
* parallelism level.
*/
def reduceByKey(func: (V, V) => V): RDD[(K, V)]
例如,單詞統計
scala> val strArray = Array("redis redis spark yarn yarn","yarn hadoop hadoop")
strArray: Array[String] = Array(redis redis spark yarn yarn, yarn hadoop hadoop)
scala> sc.parallelize(strArray).flatMap(x=>x.split(" ")).map((_,1)).reduceByKey(_+_).collect().foreach(println)
(yarn,3)
(spark,1)
(hadoop,2)
(redis,2)
aggregateByKey(zeroValue)(seqOp, combOp, [numPartitions])
對每個key的值進行聚合操作。該方法是可以返回一個不同的結果類型
第一個參數zeroValue:每個key的初始值
第二個參數seqOp:用來先對每個分區內的數據按照key分別進行定義進行函數定義的操作
第三個參數combOp:對經過 Seq Function 處理過的數據按照key分別進行合併
/**
* Aggregate the values of each key, using given combine functions and a neutral "zero value".
* This function can return a different result type, U, than the type of the values in this RDD,
* V. Thus, we need one operation for merging a V into a U and one operation for merging two U's,
* as in scala.TraversableOnce. The former operation is used for merging values within a
* partition, and the latter is used for merging values between partitions. To avoid memory
* allocation, both of these functions are allowed to modify and return their first argument
* instead of creating a new U.
*/
def aggregateByKey[U: ClassTag](zeroValue: U)(seqOp: (U, V) => U,
combOp: (U, U) => U): RDD[(K, U)]
例如,特別注意下面的yarn是23,爲什麼呢?因爲0號分區和1號分區都有yarn,所有初始值10增加了兩次
scala> val strArray = Array("redis redis spark yarn yarn","yarn hadoop hadoop")
strArray: Array[String] = Array(redis redis spark yarn yarn, yarn hadoop hadoop)
scala> var mapRDD = sc.parallelize(strArray).flatMap(x=>x.split(" ")).map((_,1))
mapRDD: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[117] at map at <console>:26
scala> mapRDD.mapPartitionsWithIndex((index,ite)=>{
| ite.map(x=>(index,x))
| }).collect().foreach(println)
(0,(redis,1))
(0,(redis,1))
(0,(spark,1))
(0,(yarn,1))
(0,(yarn,1))
(1,(yarn,1))
(1,(hadoop,1))
(1,(hadoop,1))
scala> mapRDD.aggregateByKey(10)((u,v)=>u+v,_+_).collect().foreach(println)
(yarn,23)
(spark,11)
(hadoop,12)
(redis,12)
求平均值,可以用groupByKey,也可以用reduceByKey
// 方法一,用groupByKey進行分組,然後用map求平均值,好理解
var numRDD = sc.parallelize(1 to 5).map(x=>("num",x)).groupByKey().map(x=>{
var ct=0
var num=0
x._2.foreach(a=>{
num=a+num
ct+=1
})
(x._1,num/ct)
})
numRDD.collect()
//方法二,先用map把key的值和key出現的次數用元組記錄起來,然後用reduceByKey方法計算平均值
sc.parallelize(1 to 5).map(x=>("num",x)).map(x=>(x._1,(x._2,1))).reduceByKey((a,b)=>{
(a._1+b._1,a._2+b._2)
}).map(x=>(x._1,x._2._1/x._2._2)).collect()
sortByKey([ascending], [numPartitions])
根據key對RDD進行排序,返回一個有序的RDD即每個partition內的數據都是有序的。全部返回給driver program
會是全局有序的。
還可以通過指定ascending=false
來降序
/**
* Sort the RDD by key, so that each partition contains a sorted range of the elements. Calling
* `collect` or `save` on the resulting RDD will return or output an ordered list of records
* (in the `save` case, they will be written to multiple `part-X` files in the filesystem, in
* order of the keys).
*/
// TODO: this currently doesn't work on P other than Tuple2!
def sortByKey(ascending: Boolean = true, numPartitions: Int = self.partitions.length)
: RDD[(K, V)]
例如
scala> val path = "/user/root/input/words.txt"
path: String = /user/root/input/words.txt
scala> val fileRDD = sc.textFile(path)
fileRDD: org.apache.spark.rdd.RDD[String] = /user/root/input/words.txt MapPartitionsRDD[1] at textFile at <console>:26
scala> fileRDD.flatMap(x=>x.split(" ")).map((_,1)).sortByKey(false).collect().foreach(println)
(spark.2,1)
(spark.2,1)
(spark.1,1)
(redis.4,1)
(redis.4,1)
(redis.3,1)
(redis.2,1)
(redis.1,1)
(redis.1,1)
(flume.4,1)
(flume.3,1)
(flume.3,1)
join(otherDataset, [numPartitions])
根據key來做兩個RDD之間的join。可以參考數據庫中兩個表的join操作。
同樣地也有leftOuterJoin、rightOuterJoin、fullOuterJoin
/**
* Return an RDD containing all pairs of elements with matching keys in `this` and `other`. Each
* pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in `this` and
* (k, v2) is in `other`. Performs a hash join across the cluster.
*/
def join[W](other: RDD[(K, W)]): RDD[(K, (V, W))]
例如
val path = "/user/root/input/words.txt"
val fileRDD = sc.textFile(path)
val wcRDD1 = fileRDD.flatMap(x=>x.split(" ")).map((_,1))
val strArray = Array("redis.1 redis.2 spark.2 yarn yarn flume.4","yarn hadoop hadoop")
val wcRDD2 = sc.parallelize(strArray).flatMap(x=>x.split(" ")).map((_,1))
wcRDD1.collect().foreach(println)
wcRDD2.collect().foreach(println)
wcRDD1.join(wcRDD2).collect().foreach(println)
wcRDD1.leftOuterJoin(wcRDD2).collect().foreach(println)
wcRDD1.rightOuterJoin(wcRDD2).collect().foreach(println)
wcRDD1.fullOuterJoin(wcRDD2).collect().foreach(println)
cogroup(otherDataset, [numPartitions])
如果輸入的RDD類型爲(K, V) 和(K, W),則返回的RDD類型爲 (K, (Iterable[V], Iterable[W]))
. 該操作與 groupWith等同
/**
* For each key k in `this` or `other`, return a resulting RDD that contains a tuple with the
* list of values for that key in `this` as well as `other`.
*/
def cogroup[W](other: RDD[(K, W)]): RDD[(K, (Iterable[V], Iterable[W]))]
例如,基於上面例子中join的代碼
wcRDD1.cogroup(wcRDD2).collect().foreach(println)
cartesian(otherDataset)
對兩個RDD做笛卡爾積,RDD[T] 笛卡爾 RDD[U],返回RDD[(T, U)]
/**
* Return the Cartesian product of this RDD and another one, that is, the RDD of all pairs of
* elements (a, b) where a is in `this` and b is in `other`.
*/
def cartesian[U: ClassTag](other: RDD[U]): RDD[(T, U)]
例如
val numRDD1 = sc.parallelize(1 to 2)
val numRDD2 = sc.parallelize(3 to 5)
numRDD1.cartesian(numRDD2).collect().foreach(println)
pipe(command, [envVars])
對RDD的每個partition都調用外部程序。通過pipe(),你可以將RDD中的各元素從標準輸入流中以字符串形式讀出,並對這些元素執行任何你需要的操作,然後把結果以字符串的形式寫入標準輸出。通過這個方法可以與shell、python
等其他語言協作完成計算。
/**
* Return an RDD created by piping elements to a forked external process.
*/
def pipe(command: String, env: Map[String, String]): RDD[String]
例如,我們以Hadoop Streaming中的例子爲例,調用python版本的reduce.py文件來執行reduce操作。由於RDD中的partition是分佈到各個機器的Executor進程裏,所有腳本文件需要在每個機器上都存在。其中我們還傳了環境變量信息。
import sys
import os
# 存儲<word,count>的字典 根本不需要, 這樣極大地浪費存儲空間
word2count = {}
# 需要臨時變量存儲key即word,也需要臨時變量count來統計
word = ""
count = 0
started = 0
# 從標準輸入一行行讀取數據
for line in sys.stdin:
# 去除首尾空格
line = line.strip()
# 獲取單詞和單詞數
newword, newcount = line.split()
if word != newword:
if started == 1:
# 打印上一輪的單詞統計
print "{}\t{}".format(word, count)
word = newword
count = int(newcount)
started = 1
else:
count = count + int(newcount)
# word2count[word] = word2count.get(word, 0) + int(count)
print "{}\t{}".format(word, count)
print "{}\t{}".format(os.getenv("red"), 1)
print "{}\t{}".format(os.getenv("azure"), 1)
# 不準這樣輸出, 這樣會失去map輸出數據的順序性
# for word, count in word2count.items():
# print "{}\t{}".format(word, count)
val strArray = Array("redis redis spark yarn yarn","yarn hadoop hadoop")
val rdd = sc.parallelize(strArray).flatMap(x=>x.split(" ")).map(x=>x+"\t"+1)
val colors = Map("red" -> "#FF0000", "azure" -> "#F0FFFF")
rdd.pipe("/tmp/reduce.py",colors).collect().foreach(println)
coalesce(numPartitions)
將RDD的分區數減至指定的numPartitions分區數。通常在用filter算子過濾掉大量數據後再執行coalesce會執行的更加高效。
這會造成narrow dependency
。如果從1000個分區降到10個分區,那麼就不會有shuffle操作,而是每個新分區佔用10個當前分區。如果是設置更高的分區數,會保留當前分區。
然而如果做一個很極致的coalesce,如設置分區數爲1,則有可能造成計算只會在很少的節點上運行。爲了避免這種情況,可以添加shuffle=true
。但是會意味着當前的upstream partition
會並行執行。
注意:如果添加shuffle=true
,那麼就真的可以設置更高的分區數。這是很有用的,如果你有很少的分區數,但是可能存在幾個分區數據量異常的大。這個時候調用coalesce(1000, shuffle = true)
將會使用has partitioner
將數據分發至1000個分區。可選參數partition coalescer
一定要是序列化的。
/**
* Return a new RDD that is reduced into `numPartitions` partitions.
*
* This results in a narrow dependency, e.g. if you go from 1000 partitions
* to 100 partitions, there will not be a shuffle, instead each of the 100
* new partitions will claim 10 of the current partitions. If a larger number
* of partitions is requested, it will stay at the current number of partitions.
*
* However, if you're doing a drastic coalesce, e.g. to numPartitions = 1,
* this may result in your computation taking place on fewer nodes than
* you like (e.g. one node in the case of numPartitions = 1). To avoid this,
* you can pass shuffle = true. This will add a shuffle step, but means the
* current upstream partitions will be executed in parallel (per whatever
* the current partitioning is).
*
* @note With shuffle = true, you can actually coalesce to a larger number
* of partitions. This is useful if you have a small number of partitions,
* say 100, potentially with a few partitions being abnormally large. Calling
* coalesce(1000, shuffle = true) will result in 1000 partitions with the
* data distributed using a hash partitioner. The optional partition coalescer
* passed in must be serializable.
*/
def coalesce(numPartitions: Int, shuffle: Boolean = false,
partitionCoalescer: Option[PartitionCoalescer] = Option.empty)
(implicit ord: Ordering[T] = null)
: RDD[T]
例如
scala> rdd.partitions.length
res15: Int = 2
scala> rdd.coalesce(1).partitions.length
res16: Int = 1
scala> rdd.coalesce(3).partitions.length
res17: Int = 2
scala> rdd.coalesce(3,true).partitions.length
res18: Int = 3
repartition(numPartitions)
隨機地重新shuffleRDD中的數據,以創建更多或更少的分區,並在它們之間進行平衡。這總是通過網絡shuffle所有的數據,即一定會造成shuffle操作。如果是減少分區數,可以考慮用上面的coalesce
/**
* Return a new RDD that has exactly numPartitions partitions.
*
* Can increase or decrease the level of parallelism in this RDD. Internally, this uses
* a shuffle to redistribute data.
*
* If you are decreasing the number of partitions in this RDD, consider using `coalesce`,
* which can avoid performing a shuffle.
*
* TODO Fix the Shuffle+Repartition data loss issue described in SPARK-23207.
*/
def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T]
例如
scala> rdd.repartition(1).partitions.length
res20: Int = 1
repartitionAndSortWithinPartitions(partitioner)
根據指定的分區器對RDD進行重分區,並且,重分區後對每個分區的數據根據key進行排序,即保證區內有序。
如果想重分區後再在每個分區內排序,可以調用該方法repartitionAndSortWithinPartitions(partitioner)
,這是更加有效的,因爲會在shuffle的過程就進行排序。
/**
* Repartition the RDD according to the given partitioner and, within each resulting partition,
* sort records by their keys.
*
* This is more efficient than calling `repartition` and then sorting within each partition
* because it can push the sorting down into the shuffle machinery.
*/
def repartitionAndSortWithinPartitions(partitioner: Partitioner): RDD[(K, V)]
例如
val strArray = Array("redis redis spark yarn yarn","yarn hadoop hadoop")
val rdd = sc.parallelize(strArray).flatMap(x=>x.split(" ")).map((_,1))
val newRDD = rdd.repartitionAndSortWithinPartitions(new org.apache.spark.HashPartitioner(3))
Action
reduce(func)
使用函數func(它接受兩個參數並返回一個)來聚合RDD中的元素。這個函數應該符合交換律和結合律,這樣它才能被正確地並行計算。
/**
* Reduces the elements of this RDD using the specified commutative and
* associative binary operator.
*/
def reduce(f: (T, T) => T): T
例如,求和
scala> sc.parallelize(1 to 10).reduce(_+_)
res34: Int = 55
collect()
將RDD中的所有元素以數組的形式返回給driver program
,注意最好是在使用filter
算子過濾掉大量的數據後且確保返回的數據充分小的時候使用,因爲返回的數據集會全部加載到driver program
節點的內存裏
/**
* Return an array that contains all of the elements in this RDD.
*
* @note This method should only be used if the resulting array is expected to be small, as
* all the data is loaded into the driver's memory.
*/
def collect(): Array[T]
例如
scala> sc.parallelize(1 to 10).collect()
res35: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
count()
計算RDD中元素的個數
/**
* Return the number of elements in the RDD.
*/
def count(): Long
例如,計算文件有多少行
scala> sc.textFile("/user/root/input/words.txt").count()
res36: Long = 3
scala> sc.textFile("/user/root/input/words.txt").collect().foreach(println)
redis.1 redis.1 redis.2 redis.3 redis.4 redis.4
spark.1 spark.2 spark.2
flume.3 flume.3 flume.4
first()
獲取RDD中第一個元素,等同於take(1)
/**
* Return the first element in this RDD.
*/
def first(): T
take(n)
獲取RDD中前n個元素,會一個分區一個分區的掃描知道滿足返回數量
/**
* Take the first num elements of the RDD. It works by first scanning one partition, and use the
* results from that partition to estimate the number of additional partitions needed to satisfy
* the limit.
*
* @note This method should only be used if the resulting array is expected to be small, as
* all the data is loaded into the driver's memory.
*
* @note Due to complications in the internal implementation, this method will raise
* an exception if called on an RDD of `Nothing` or `Null`.
*/
def take(num: Int): Array[T]
例如,
scala> val numRDD = sc.parallelize(1 to 10,3)
numRDD: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[51] at parallelize at <console>:24
scala> numRDD.take(4)
res39: Array[Int] = Array(1, 2, 3, 4)
scala> numRDD.take(1)
res40: Array[Int] = Array(1)
scala> numRDD.first()
res41: Int = 1
takeSample(withReplacement, num, [seed])
返回一個包含數據集的隨機num元素樣本的數組,可以替換,也可以不替換,可以預先指定隨機數生成器種子。
/**
* Return a fixed-size sampled subset of this RDD in an array
*
* @param withReplacement whether sampling is done with replacement
* @param num size of the returned sample
* @param seed seed for the random number generator
* @return sample of specified size in an array
*
* @note this method should only be used if the resulting array is expected to be small, as
* all the data is loaded into the driver's memory.
*/
def takeSample(
withReplacement: Boolean,
num: Int,
seed: Long = Utils.random.nextLong): Array[T]
例如
scala> numRDD.takeSample(true,6)
res42: Array[Int] = Array(2, 9, 9, 10, 2, 7)
scala> numRDD.takeSample(false,6)
res43: Array[Int] = Array(5, 9, 6, 10, 3, 8)
takeOrdered(n, [ordering])
使用自然順序或者比較器返回前n個元素
/**
* Returns the first k (smallest) elements from this RDD as defined by the specified
* implicit Ordering[T] and maintains the ordering. This does the opposite of [[top]].
* For example:
* {{{
* sc.parallelize(Seq(10, 4, 2, 12, 3)).takeOrdered(1)
* // returns Array(2)
*
* sc.parallelize(Seq(2, 3, 4, 5, 6)).takeOrdered(2)
* // returns Array(2, 3)
* }}}
*
* @note This method should only be used if the resulting array is expected to be small, as
* all the data is loaded into the driver's memory.
*
* @param num k, the number of elements to return
* @param ord the implicit ordering for T
* @return an array of top elements
*/
def takeOrdered(num: Int)(implicit ord: Ordering[T]): Array[T]
例如
scala> numRDD.takeOrdered(3)
res45: Array[Int] = Array(1, 2, 3)
top(n, [ordering])
和takeOrdered類似,只不過底層實現就是用的takeOrdered即takeOrdered(num)(ord.reverse)
/**
* Returns the top k (largest) elements from this RDD as defined by the specified
* implicit Ordering[T] and maintains the ordering. This does the opposite of
* [[takeOrdered]]. For example:
* {{{
* sc.parallelize(Seq(10, 4, 2, 12, 3)).top(1)
* // returns Array(12)
*
* sc.parallelize(Seq(2, 3, 4, 5, 6)).top(2)
* // returns Array(6, 5)
* }}}
*
* @note This method should only be used if the resulting array is expected to be small, as
* all the data is loaded into the driver's memory.
*
* @param num k, the number of top elements to return
* @param ord the implicit ordering for T
* @return an array of top elements
*/
def top(num: Int)(implicit ord: Ordering[T]): Array[T]
例如
scala> numRDD.top(2)
res46: Array[Int] = Array(10, 9)
saveAsTextFile(path)
將RDD中的元素作爲文本文件(或文本文件集)寫入本地文件系統、HDFS或任何其他hadoop支持的文件系統的給定目錄中。Spark將對每個元素調用toString,將其轉換爲文件中的一行文本。
/**
* Save this RDD as a text file, using string representations of elements.
*/
def saveAsTextFile(path: String): Unit
例如
val strArray = Array("redis redis spark yarn yarn","yarn hadoop hadoop")
val rdd = sc.parallelize(strArray).flatMap(x=>x.split(" ")).map((_,1)).reduceByKey(_+_)
rdd.saveAsTextFile("/tmp/rdd.5")
saveAsSequenceFile(path)
只有鍵值對RDD才能調用。將RDD文件輸出成Hadoop SequenceFile
/**
* Output the RDD as a Hadoop SequenceFile using the Writable types we infer from the RDD's key
* and value types. If the key or value are Writable, then we use their classes directly;
* otherwise we map primitive types such as Int and Double to IntWritable, DoubleWritable, etc,
* byte arrays to BytesWritable, and Strings to Text. The `path` can be on any Hadoop-supported
* file system.
*/
def saveAsSequenceFile(
path: String,
codec: Option[Class[_ <: CompressionCodec]] = None): Unit
例如,用saveAsSequenceFile
將RDD數據存儲爲Hadoop SequenceFile
文件,並用sequenceFile
讀取Hadoop SequenceFile
文件
rdd.saveAsSequenceFile("/tmp/rdd.6")
scala> sc.sequenceFile[String,Int]("/tmp/rdd.6")
res59: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[77] at sequenceFile at <console>:26
scala> res59.collect()
res60: Array[(String, Int)] = Array((yarn,3), (spark,1), (hadoop,2), (redis,2))
saveAsObjectFile(path)
以Java序列化對象的方式將RDD中數據輸出,可以用SparkContext.objectFile()
來加載
/**
* Save this RDD as a SequenceFile of serialized objects.
*/
def saveAsObjectFile(path: String): Unit
例如,將RDD中的數據以序列化對象的方式輸出,可以用SparkContext.objectFile()
來加載回來
scala> import scala.Tuple2
scala> rdd.saveAsObjectFile("/tmp/rdd.7")
scala> sc.objectFile[Tuple2[String,Int]]("/tmp/rdd.7").collect()
res66: Array[(String, Int)] = Array((yarn,3), (spark,1), (hadoop,2), (redis,2))
countByKey()
僅KV對類型的RDD纔有該方法,獲取每個key的個數,收集後返回一個Map對象給driver program
。
如果結果數據量很大,建議用 rdd.mapValues(_ => 1L).reduceByKey(_ + _)
這樣是返回一個RDD。底層就是這麼實現的。
/**
* Count the number of elements for each key, collecting the results to a local Map.
*
* @note This method should only be used if the resulting map is expected to be small, as
* the whole thing is loaded into the driver's memory.
* To handle very large results, consider using rdd.mapValues(_ => 1L).reduceByKey(_ + _), which
* returns an RDD[T, Long] instead of a map.
*/
def countByKey(): Map[K, Long] = self.withScope {
self.mapValues(_ => 1L).reduceByKey(_ + _).collect().toMap
}
例如
scala> val strArray = Array("redis redis spark yarn yarn","yarn hadoop hadoop")
strArray: Array[String] = Array(redis redis spark yarn yarn, yarn hadoop hadoop)
scala> val rdd = sc.parallelize(strArray).flatMap(x=>x.split(" ")).map((_,1))
rdd: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[88] at map at <console>:27
scala> rdd.countByKey()
res68: scala.collection.Map[String,Long] = Map(yarn -> 3, spark -> 1, hadoop -> 2, redis -> 2)
foreach(func)
對RDD中的每個元素運行函數func。這通常是爲了解決諸如更新Accumulator
或與外部存儲系統交互的問題。
注意:在foreach()之外修改除Accumulator
之外的其他變量可能會導致未定義的行爲。有關更多細節,請參考Understanding closures。
foreach
和foreachPartition
在代碼實現上幾乎是一樣的
/**
* Applies a function f to all elements of this RDD.
*/
def foreach(f: T => Unit): Unit = withScope {
val cleanF = sc.clean(f)
sc.runJob(this, (iter: Iterator[T]) => iter.foreach(cleanF))
}
例如,注意下面中並沒有元素打印,不是因爲RDD沒有數據,而是因爲數據打印在executor節點上
scala> rdd.foreach(println)
scala>
異步版本的Action算子
Spark RDD API還公開了一些操作的異步版本,比如foreachAsync
相對應foreach
,它立即向調用者返回一個FutureAction
,而不是阻塞。這可以用於管理或等待操作的異步執行。