【spark,rdd,2】RDD基本轉換算子

Transformation Meaning
map(func) Return a new distributed dataset formed by passing each element of the source through a function func.
filter(func) Return a new dataset formed by selecting those elements of the source on which funcreturns true.
flatMap(func) Similar to map, but each input item can be mapped to 0 or more output items (so funcshould return a Seq rather than a single item).
mapPartitions(func) Similar to map, but runs separately on each partition (block) of the RDD, so func must be of type Iterator<T> => Iterator<U> when running on an RDD of type T.
mapPartitionsWithIndex(func) Similar to mapPartitions, but also provides func with an integer value representing the index of the partition, so func must be of type (Int, Iterator<T>) => Iterator<U> when running on an RDD of type T.
sample(withReplacementfractionseed) Sample a fraction fraction of the data, with or without replacement, using a given random number generator seed.
union(otherDataset) Return a new dataset that contains the union of the elements in the source dataset and the argument.
intersection(otherDataset) Return a new RDD that contains the intersection of elements in the source dataset and the argument.
distinct([numTasks])) Return a new dataset that contains the distinct elements of the source dataset.
groupByKey([numTasks]) When called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable<V>) pairs. 
Note: If you are grouping in order to perform an aggregation (such as a sum or average) over each key, using reduceByKey or aggregateByKey will yield much better performance. 
Note: By default, the level of parallelism in the output depends on the number of partitions of the parent RDD. You can pass an optional numTasks argument to set a different number of tasks.
reduceByKey(func, [numTasks]) When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs where the values for each key are aggregated using the given reduce function func, which must be of type (V,V) => V. Like in groupByKey, the number of reduce tasks is configurable through an optional second argument.
aggregateByKey(zeroValue)(seqOpcombOp, [numTasks]) When called on a dataset of (K, V) pairs, returns a dataset of (K, U) pairs where the values for each key are aggregated using the given combine functions and a neutral "zero" value. Allows an aggregated value type that is different than the input value type, while avoiding unnecessary allocations. Like in groupByKey, the number of reduce tasks is configurable through an optional second argument.
sortByKey([ascending], [numTasks]) When called on a dataset of (K, V) pairs where K implements Ordered, returns a dataset of (K, V) pairs sorted by keys in ascending or descending order, as specified in the boolean ascending argument.
join(otherDataset, [numTasks]) When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with all pairs of elements for each key. Outer joins are supported through leftOuterJoin,rightOuterJoin, and fullOuterJoin.
cogroup(otherDataset, [numTasks]) When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (Iterable<V>, Iterable<W>)) tuples. This operation is also called groupWith.
cartesian(otherDataset) When called on datasets of types T and U, returns a dataset of (T, U) pairs (all pairs of elements).
pipe(command[envVars]) Pipe each partition of the RDD through a shell command, e.g. a Perl or bash script. RDD elements are written to the process's stdin and lines output to its stdout are returned as an RDD of strings.
coalesce(numPartitions) Decrease the number of partitions in the RDD to numPartitions. Useful for running operations more efficiently after filtering down a large dataset.
repartition(numPartitions) Reshuffle the data in the RDD randomly to create either more or fewer partitions and balance it across them. This always shuffles all data over the network.
repartitionAndSortWithinPartitions(partitioner) Repartition the RDD according to the given partitioner and, within each resulting partition, sort records by their keys. This is more efficient than calling repartition and then sorting within each partition because it can push the sorting down into the shuffle machinery.

基本轉換算子可以對任何類型的數據進行操作

1.map
map是對RDD中的每個元素都執行一個指定的函數來產生一個新的RDD。任何原RDD中的元素在新RDD中都有且只有一個元素與之對應。
輸入分區與輸出分區一對一,即:有多少個輸入分區,就有多少個輸出分區。

hadoop fs -cat /tmp/lxw1234/1.txt
hello world
hello spark
hello hive
 
 
//讀取HDFS文件到RDD
scala> var data = sc.textFile("/tmp/lxw1234/1.txt")
data: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[1] at textFile at :21
 
//使用map算子
scala> var mapresult = data.map(line => line.split("\\s+"))
mapresult: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[2] at map at :23
 
//運算map算子結果
scala> mapresult.collect
res0: Array[Array[String]] = Array(Array(hello, world), Array(hello, spark), Array(hello, hive))



2.flatMap
屬於Transformation算子,第一步和map一樣,最後將所有的輸出分區合併成一個。

latMap扁平話意思大概就是先用了一次map之後對全部數據再一次map。

/使用flatMap算子
scala> var flatmapresult = data.flatMap(line => line.split("\\s+"))
flatmapresult: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[3] at flatMap at :23
 
//運算flagMap算子結果
scala> flatmapresult.collect
res1: Array[String] = Array(hello, world, hello, spark, hello, hive)
 

使用flatMap時候需要注意:
flatMap會將字符串看成是一個字符數組。
看下面的例子:

scala> data.map(_.toUpperCase).collect
res32: Array[String] = Array(HELLO WORLD, HELLO SPARK, HELLO HIVE, HI SPARK)
scala> data.flatMap(_.toUpperCase).collect
res33: Array[Char] = Array(H, E, L, L, O, , W, O, R, L, D, H, E, L, L, O, , S, P, A, R, K, H, E, L, L, O, , H, I, V, E, H, I, , S, P, A, R, K)
 

再看:

scala> data.map(x => x.split("\\s+")).collect
res34: Array[Array[String]] = Array(Array(hello, world), Array(hello, spark), Array(hello, hive), Array(hi, spark))
 
scala> data.flatMap(x => x.split("\\s+")).collect
res35: Array[String] = Array(hello, world, hello, spark, hello, hive, hi, spark)
 

這次的結果好像是預期的,最終結果裏面並沒有把字符串當成字符數組。
這是因爲這次map函數中返回的類型爲Array[String],並不是String。
flatMap只會將String扁平化成字符數組,並不會把Array[String]也扁平化成字符數組。


http://blog.csdn.net/u010824591/article/details/50732996


3.distinct
對RDD中的元素進行去重操作。

scala> data.flatMap(line => line.split("\\s+")).collect
res61: Array[String] = Array(hello, world, hello, spark, hello, hive, hi, spark)
 
scala> data.flatMap(line => line.split("\\s+")).distinct.collect
res62: Array[String] = Array(hive, hello, world, spark, hi)





5.coalesce

def coalesce(numPartitions: Int, shuffle: Boolean = false)(implicit ord: Ordering[T] = null): RDD[T]

該函數用於將RDD進行重分區,使用HashPartitioner。

第一個參數爲重分區的數目,第二個爲是否進行shuffle,默認爲false;

scala> var data = sc.textFile("/tmp/lxw1234/1.txt")
data: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[53] at textFile at :21
 
scala> data.collect
res37: Array[String] = Array(hello world, hello spark, hello hive, hi spark)
 
scala> data.partitions.size
res38: Int = 2 //RDD data默認有兩個分區
 
scala> var rdd1 = data.coalesce(1)
rdd1: org.apache.spark.rdd.RDD[String] = CoalescedRDD[2] at coalesce at :23
 
scala> rdd1.partitions.size
res1: Int = 1 //rdd1的分區數爲1
 
 
scala> var rdd1 = data.coalesce(4)
rdd1: org.apache.spark.rdd.RDD[String] = CoalescedRDD[3] at coalesce at :23
 
scala> rdd1.partitions.size
res2: Int = 2 //如果重分區的數目大於原來的分區數,那麼必須指定shuffle參數爲true,//否則,分區數不便
 
scala> var rdd1 = data.coalesce(4,true)
rdd1: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[7] at coalesce at :23
 
scala> rdd1.partitions.size
res3: Int = 4



6. repartition


def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T]

該函數其實就是coalesce函數第二個參數爲true的實現

 

scala> var rdd2 = data.repartition(1)
rdd2: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[11] at repartition at :23
 
scala> rdd2.partitions.size
res4: Int = 1
 
scala> var rdd2 = data.repartition(4)
rdd2: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[15] at repartition at :23
 
scala> rdd2.partitions.size
res5: Int = 4

7.randomSplit

def randomSplit(weights: Array[Double], seed: Long = Utils.random.nextLong): Array[RDD[T]]

該函數根據weights權重,將一個RDD切分成多個RDD。

該權重參數爲一個Double數組

第二個參數爲random的種子,基本可忽略。

scala> var rdd = sc.makeRDD(1 to 10,10)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[16] at makeRDD at :21
 
scala> rdd.collect
res6: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
 
scala> var splitRDD = rdd.randomSplit(Array(1.0,2.0,3.0,4.0))
splitRDD: Array[org.apache.spark.rdd.RDD[Int]] = Array(MapPartitionsRDD[17] at randomSplit at :23,
MapPartitionsRDD[18] at randomSplit at :23,
MapPartitionsRDD[19] at randomSplit at :23,
MapPartitionsRDD[20] at randomSplit at :23)
 
//這裏注意:randomSplit的結果是一個RDD數組
scala> splitRDD.size
res8: Int = 4
//由於randomSplit的第一個參數weights中傳入的值有4個,因此,就會切分成4個RDD,
//把原來的rdd按照權重1.0,2.0,3.0,4.0,隨機劃分到這4個RDD中,權重高的RDD,劃分到//的機率就大一些。
//注意,權重的總和加起來爲1,否則會不正常
 
scala> splitRDD(0).collect
res10: Array[Int] = Array(1, 4)
 
scala> splitRDD(1).collect
res11: Array[Int] = Array(3)
 
scala> splitRDD(2).collect
res12: Array[Int] = Array(5, 9)
 
scala> splitRDD(3).collect
res13: Array[Int] = Array(2, 6, 7, 8, 10)


8.glom 

def glom(): RDD[Array[T]]

該函數是將RDD中每一個分區中類型爲T的所有的元素轉換成Array[T],這樣每一個分區就只有一個數組元素。


scala> var rdd = sc.makeRDD(1 to 10,3)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[38] at makeRDD at :21
scala> rdd.partitions.size
res33: Int = 3 //該RDD有3個分區
scala> rdd.glom().collect
res35: Array[Array[Int]] = Array(Array(1, 2, 3), Array(4, 5, 6), Array(7, 8, 9, 10))
//glom將每個分區中的元素放到一個數組中,這樣,結果就變成了3個數組

9.union
該函數比較簡單,就是將兩個RDD進行合併,不去重
scala> var rdd1 = sc.makeRDD(1 to 2,1)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[45] at makeRDD at :21
 
scala> rdd1.collect
res42: Array[Int] = Array(1, 2)
 
scala> var rdd2 = sc.makeRDD(2 to 3,1)
rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[46] at makeRDD at :21
 
scala> rdd2.collect
res43: Array[Int] = Array(2, 3)
 
scala> rdd1.union(rdd2).collect
res44: Array[Int] = Array(1, 2, 2, 3)
10.intersection

def intersection(other: RDD[T]): RDD[T]
def intersection(other: RDD[T], numPartitions: Int): RDD[T]
def intersection(other: RDD[T], partitioner: Partitioner)(implicit ord: Ordering[T] = null): RDD[T]

該函數返回兩個RDD的交集,並且去重
參數numPartitions指定返回的RDD的分區數。
參數partitioner用於指定分區函數

scala> var rdd1 = sc.makeRDD(1 to 2,1)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[45] at makeRDD at :21
 
scala> rdd1.collect
res42: Array[Int] = Array(1, 2)
 
scala> var rdd2 = sc.makeRDD(2 to 3,1)
rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[46] at makeRDD at :21
 
scala> rdd2.collect
res43: Array[Int] = Array(2, 3)
 
scala> rdd1.intersection(rdd2).collect
res45: Array[Int] = Array(2)
 
scala> var rdd3 = rdd1.intersection(rdd2)
rdd3: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[59] at intersection at :25
 
scala> rdd3.partitions.size
res46: Int = 1
 
scala> var rdd3 = rdd1.intersection(rdd2,2)
rdd3: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[65] at intersection at :25
 
scala> rdd3.partitions.size


11.subtract

def subtract(other: RDD[T]): RDD[T]
def subtract(other: RDD[T], numPartitions: Int): RDD[T]
def subtract(other: RDD[T], partitioner: Partitioner)(implicit ord: Ordering[T] = null): RDD[T]

該函數類似於intersection,但返回在RDD中出現,並且不在otherRDD中出現的元素,不去重
參數含義同intersection

scala> var rdd1 = sc.makeRDD(Seq(1,2,2,3))
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[66] at makeRDD at :21
 
scala> rdd1.collect
res48: Array[Int] = Array(1, 2, 2, 3)
 
scala> var rdd2 = sc.makeRDD(3 to 4)
rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[67] at makeRDD at :21
 
scala> rdd2.collect
res49: Array[Int] = Array(3, 4)
 
scala> rdd1.subtract(rdd2).collect
res50: Array[Int] = Array(1, 2, 2)


12.mapPartitions

def mapPartitions[U](f: (Iterator[T]) => Iterator[U], preservesPartitioning: Boolean = false)(implicit arg0: ClassTag[U]): RDD[U]

該函數和map函數類似,只不過映射函數的參數由RDD中的每一個元素變成了RDD中每一個分區的迭代器。如果在映射的過程中需要頻繁創建額外的對象,使用mapPartitions要比map高效的過。

比如,將RDD中的所有數據通過JDBC連接寫入數據庫,如果使用map函數,可能要爲每一個元素都創建一個connection,這樣開銷很大,如果使用mapPartitions,那麼只需要針對每一個分區建立一個connection。

參數preservesPartitioning表示是否保留父RDD的partitioner分區信息。


var rdd1 = sc.makeRDD(1 to 5,2)
//rdd1有兩個分區
scala> var rdd3 = rdd1.mapPartitions{ x => {
| var result = List[Int]()
| var i = 0
| while(x.hasNext){
| i += x.next()
| }
| result.::(i).iterator //向list中添加元素,
| }}
rdd3: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[84] at mapPartitions at :23
 
//rdd3將rdd1中每個分區中的數值累加
scala> rdd3.collect
res65: Array[Int] = Array(3, 12)
scala> rdd3.partitions.size
res66: Int = 2


12.2 mapPartitions返回一個iterator。
sc.parallelize(1 to sampleNum).sample(false, fraction).mapPartitions(
  it => {
    val bitmap = new ExtBitmap()
    val bucket = sampleNum * index.toLong
    while (it.hasNext) {
      bitmap.set(it.next() + bucket)
    }
    Set(index -> bitmap).iterator
  }
)
12.3.mapPartitions返回一個iterator。
.mapPartitions {
  it => new Iterator[String] {
    var first = true
    def hasNext = it.hasNext
    def next = {
      val n = it.next
      val res = if (first) s"$head\n$n" else n
      first = false
      res
    }
  }
}
12.4.
rdd.mapPartitions{
  it =>
     val random = new Random()
    it.map{
      case (k,v) =>
        val key = if(br_topN.value.contains(k)) k+"_"+random.nextInt(rNum) else k
        key -> v
    }
}



13.mapPartitionWithIndex

def mapPartitionsWithIndex[U](f: (Int, Iterator[T]) => Iterator[U], preservesPartitioning: Boolean = false)(implicit arg0: ClassTag[U]): RDD[U]

函數作用同mapPartitions,不過提供了兩個參數,第一個參數爲分區的索引。

var rdd1 = sc.makeRDD(1 to 5,2)
//rdd1有兩個分區
var rdd2 = rdd1.mapPartitionsWithIndex{
(x,iter) => { //(x,iter)表示兩個參數,x是分區索引
var result = List[String]()
var i = 0
while(iter.hasNext){
i += iter.next()
}
result.::(x + "|" + i).iterator
}
}
//rdd2將rdd1中每個分區的數字累加,並在每個分區的累加結果前面加了分區索引
scala> rdd2.collect
res13: Array[String] = Array(0|3, 1|12)




14.zip

def zip[U](other: RDD[U])(implicit arg0: ClassTag[U]): RDD[(T, U)]

zip函數用於將兩個RDD組合成Key/Value形式的RDD,這裏默認兩個RDD的partition數量以及元素數量都相同,否則會拋出異常。


scala> var rdd1 = sc.makeRDD(1 to 5,2)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at makeRDD at :21
 
scala> var rdd2 = sc.makeRDD(Seq("A","B","C","D","E"),2)
rdd2: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[2] at makeRDD at :21
 
scala> rdd1.zip(rdd2).collect
res0: Array[(Int, String)] = Array((1,A), (2,B), (3,C), (4,D), (5,E))
 
scala> rdd2.zip(rdd1).collect
res1: Array[(String, Int)] = Array((A,1), (B,2), (C,3), (D,4), (E,5))
 
scala> var rdd3 = sc.makeRDD(Seq("A","B","C","D","E"),3)
rdd3: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[5] at makeRDD at :21
 
scala> rdd1.zip(rdd3).collect
java.lang.IllegalArgumentException: Can't zip RDDs with unequal numbers of partitions
//如果兩個RDD分區數不同,則拋出異常


15.zipPattitions
zipPartitions函數將多個RDD按照partition組合成爲新的RDD,該函數需要組合的RDD具有相同的分區數,但對於每個分區內的元素數量沒有要求。

該函數有好幾種實現,可分爲三類:

  • 參數是一個RDD

def zipPartitions[B, V](rdd2: RDD[B])(f: (Iterator[T], Iterator[B]) => Iterator[V])(implicit arg0: ClassTag[B], arg1: ClassTag[V]): RDD[V]

def zipPartitions[B, V](rdd2: RDD[B], preservesPartitioning: Boolean)(f: (Iterator[T], Iterator[B]) => Iterator[V])(implicit arg0: ClassTag[B], arg1: ClassTag[V]): RDD[V]

這兩個區別就是參數preservesPartitioning,是否保留父RDD的partitioner分區信息

映射方法f參數爲兩個RDD的迭代器。



16.sample
sample(withReplacement,fraction,seed):以指定的隨機種子隨機抽樣出數量爲fraction的數據,withReplacement表示是抽出的數據是否放回,true爲有放回的抽樣,false爲無放回的抽樣
(例5):從RDD中隨機且有放回的抽出50%的數據,隨機種子值爲3(即可能以1 2 3的其中一個起始值)
//省略
    val rdd = sc.parallelize(1 to 10)
    val sample1 = rdd.sample(true,0.5,3)
    sample1.collect.foreach(x => print(x + " "))
    sc.stop

















































發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章