【spark，rdd，2】RDD基本轉換算子

Transformation	Meaning
map(func)	Return a new distributed dataset formed by passing each element of the source through a function func.
filter(func)	Return a new dataset formed by selecting those elements of the source on which funcreturns true.
flatMap(func)	Similar to map, but each input item can be mapped to 0 or more output items (so funcshould return a Seq rather than a single item).
mapPartitions(func)	Similar to map, but runs separately on each partition (block) of the RDD, so func must be of type Iterator<T> => Iterator<U> when running on an RDD of type T.
mapPartitionsWithIndex(func)	Similar to mapPartitions, but also provides func with an integer value representing the index of the partition, so func must be of type (Int, Iterator<T>) => Iterator<U> when running on an RDD of type T.
sample(withReplacement, fraction, seed)	Sample a fraction fraction of the data, with or without replacement, using a given random number generator seed.
union(otherDataset)	Return a new dataset that contains the union of the elements in the source dataset and the argument.
intersection(otherDataset)	Return a new RDD that contains the intersection of elements in the source dataset and the argument.
distinct([numTasks]))	Return a new dataset that contains the distinct elements of the source dataset.
groupByKey([numTasks])	When called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable<V>) pairs. Note: If you are grouping in order to perform an aggregation (such as a sum or average) over each key, using `reduceByKey` or `aggregateByKey` will yield much better performance. Note: By default, the level of parallelism in the output depends on the number of partitions of the parent RDD. You can pass an optional `numTasks` argument to set a different number of tasks.
reduceByKey(func, [numTasks])	When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs where the values for each key are aggregated using the given reduce function func, which must be of type (V,V) => V. Like in `groupByKey`, the number of reduce tasks is configurable through an optional second argument.
aggregateByKey(zeroValue)(seqOp, combOp, [numTasks])	When called on a dataset of (K, V) pairs, returns a dataset of (K, U) pairs where the values for each key are aggregated using the given combine functions and a neutral "zero" value. Allows an aggregated value type that is different than the input value type, while avoiding unnecessary allocations. Like in `groupByKey`, the number of reduce tasks is configurable through an optional second argument.
sortByKey([ascending], [numTasks])	When called on a dataset of (K, V) pairs where K implements Ordered, returns a dataset of (K, V) pairs sorted by keys in ascending or descending order, as specified in the boolean `ascending` argument.
join(otherDataset, [numTasks])	When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with all pairs of elements for each key. Outer joins are supported through `leftOuterJoin`,`rightOuterJoin`, and `fullOuterJoin`.
cogroup(otherDataset, [numTasks])	When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (Iterable<V>, Iterable<W>)) tuples. This operation is also called `groupWith`.
cartesian(otherDataset)	When called on datasets of types T and U, returns a dataset of (T, U) pairs (all pairs of elements).
pipe(command, [envVars])	Pipe each partition of the RDD through a shell command, e.g. a Perl or bash script. RDD elements are written to the process's stdin and lines output to its stdout are returned as an RDD of strings.
coalesce(numPartitions)	Decrease the number of partitions in the RDD to numPartitions. Useful for running operations more efficiently after filtering down a large dataset.
repartition(numPartitions)	Reshuffle the data in the RDD randomly to create either more or fewer partitions and balance it across them. This always shuffles all data over the network.
repartitionAndSortWithinPartitions(partitioner)	Repartition the RDD according to the given partitioner and, within each resulting partition, sort records by their keys. This is more efficient than calling `repartition` and then sorting within each partition because it can push the sorting down into the shuffle machinery.

基本轉換算子可以對任何類型的數據進行操作

1.map

map是對RDD中的每個元素都執行一個指定的函數來產生一個新的RDD。任何原RDD中的元素在新RDD中都有且只有一個元素與之對應。

輸入分區與輸出分區一對一，即：有多少個輸入分區，就有多少個輸出分區。


hadoop fs -cat
/tmp/lxw1234/1.txt

hello world

hello spark

hello hive

 

 

//讀取HDFS文件到RDD

scala>
var data
= sc.textFile("/tmp/lxw1234/1.txt")

data: org.apache.spark.rdd.RDD[String]
=
MapPartitionsRDD[1]
at textFile at :21

 

//使用map算子

scala>
var mapresult
= data.map(line
=> line.split("\\s+"))

mapresult: org.apache.spark.rdd.RDD[Array[String]]
=
MapPartitionsRDD[2]
at map at :23

 

//運算map算子結果

scala> mapresult.collect

res0:
Array[Array[String]]
=
Array(Array(hello,
world),
Array(hello,
spark),
Array(hello,
hive))

2.flatMap

屬於Transformation算子，第一步和map一樣，最後將所有的輸出分區合併成一個。

latMap扁平話意思大概就是先用了一次map之後對全部數據再一次map。

/使用flatMap算子
scala> var flatmapresult = data.flatMap(line => line.split("\\s+"))
flatmapresult: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[3] at flatMap at :23
 
//運算flagMap算子結果
scala> flatmapresult.collect
res1: Array[String] = Array(hello, world, hello, spark, hello, hive)

使用flatMap時候需要注意：
flatMap會將字符串看成是一個字符數組。
看下面的例子：

scala> data.map(_.toUpperCase).collect
res32: Array[String] = Array(HELLO WORLD, HELLO SPARK, HELLO HIVE, HI SPARK)
scala> data.flatMap(_.toUpperCase).collect
res33: Array[Char] = Array(H, E, L, L, O,  , W, O, R, L, D, H, E, L, L, O,  , S, P, A, R, K, H, E, L, L, O,  , H, I, V, E, H, I,  , S, P, A, R, K)

再看：

scala> data.map(x => x.split("\\s+")).collect
res34: Array[Array[String]] = Array(Array(hello, world), Array(hello, spark), Array(hello, hive), Array(hi, spark))
 
scala> data.flatMap(x => x.split("\\s+")).collect
res35: Array[String] = Array(hello, world, hello, spark, hello, hive, hi, spark)

這次的結果好像是預期的，最終結果裏面並沒有把字符串當成字符數組。
這是因爲這次map函數中返回的類型爲Array[String]，並不是String。
flatMap只會將String扁平化成字符數組，並不會把Array[String]也扁平化成字符數組。

http://blog.csdn.net/u010824591/article/details/50732996

3.distinct

對RDD中的元素進行去重操作。


scala> data.flatMap(line
=> line.split("\\s+")).collect

res61:
Array[String]
=
Array(hello,
world, hello, spark,
hello, hive, hi,
spark)

 

scala> data.flatMap(line
=> line.split("\\s+")).distinct.collect
res62:
Array[String]
=
Array(hive,
hello, world, spark,
hi)



5.coalesce

def coalesce(numPartitions: Int, shuffle: Boolean = false)(implicit ord: Ordering[T] = null): RDD[T]

該函數用於將RDD進行重分區，使用HashPartitioner。

第一個參數爲重分區的數目，第二個爲是否進行shuffle，默認爲false;


scala>
var data
= sc.textFile("/tmp/lxw1234/1.txt")

data: org.apache.spark.rdd.RDD[String]
=
MapPartitionsRDD[53]
at textFile at :21

 

scala> data.collect

res37:
Array[String]
=
Array(hello world,
hello spark, hello hive, hi spark)

 

scala> data.partitions.size

res38:
Int
=
2
//RDD data默認有兩個分區

 

scala>
var rdd1
= data.coalesce(1)

rdd1: org.apache.spark.rdd.RDD[String]
=
CoalescedRDD[2]
at coalesce at :23

 

scala> rdd1.partitions.size

res1:
Int
=
1
//rdd1的分區數爲1

 

 

scala>
var rdd1
= data.coalesce(4)

rdd1: org.apache.spark.rdd.RDD[String]
=
CoalescedRDD[3]
at coalesce at :23

 

scala> rdd1.partitions.size

res2:
Int
=
2
//如果重分區的數目大於原來的分區數，那麼必須指定shuffle參數爲true，//否則，分區數不便

 

scala>
var rdd1
= data.coalesce(4,true)

rdd1: org.apache.spark.rdd.RDD[String]
=
MapPartitionsRDD[7]
at coalesce at :23

 

scala> rdd1.partitions.size

res3:
Int
=
4

6. repartition

def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T]

該函數其實就是coalesce函數第二個參數爲true的實現

scala> var rdd2 = data.repartition(1)
rdd2: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[11] at repartition at :23
 
scala> rdd2.partitions.size
res4: Int = 1
 
scala> var rdd2 = data.repartition(4)
rdd2: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[15] at repartition at :23
 
scala> rdd2.partitions.size
res5: Int = 4

7.randomSplit

def randomSplit(weights: Array[Double], seed: Long = Utils.random.nextLong): Array[RDD[T]]

該函數根據weights權重，將一個RDD切分成多個RDD。

該權重參數爲一個Double數組

第二個參數爲random的種子，基本可忽略。


scala>
var rdd
= sc.makeRDD(1
to 10,10)

rdd: org.apache.spark.rdd.RDD[Int]
=
ParallelCollectionRDD[16]
at makeRDD at :21

 

scala> rdd.collect

res6:
Array[Int]
=
Array(1,
2,
3,
4,
5,
6,
7,
8,
9,
10)

 

scala>
var splitRDD
= rdd.randomSplit(Array(1.0,2.0,3.0,4.0))

splitRDD:
Array[org.apache.spark.rdd.RDD[Int]]
=
Array(MapPartitionsRDD[17]
at randomSplit at :23,

MapPartitionsRDD[18]
at randomSplit at :23,

MapPartitionsRDD[19]
at randomSplit at :23,

MapPartitionsRDD[20]
at randomSplit at :23)

 

//這裏注意：randomSplit的結果是一個RDD數組

scala> splitRDD.size

res8:
Int
=
4

//由於randomSplit的第一個參數weights中傳入的值有4個，因此，就會切分成4個RDD,

//把原來的rdd按照權重1.0,2.0,3.0,4.0，隨機劃分到這4個RDD中，權重高的RDD，劃分到//的機率就大一些。

//注意，權重的總和加起來爲1，否則會不正常

 

scala> splitRDD(0).collect

res10:
Array[Int]
=
Array(1,
4)

 

scala> splitRDD(1).collect

res11:
Array[Int]
=
Array(3)

 

scala> splitRDD(2).collect

res12:
Array[Int]
=
Array(5,
9)

 

scala> splitRDD(3).collect
res13:
Array[Int]
=
Array(2,
6,
7,
8,
10)
來源： http://lxw1234.com/archives/2015/07/343.htm

8.glom

def glom(): RDD[Array[T]]

該函數是將RDD中每一個分區中類型爲T的所有的元素轉換成Array[T]，這樣每一個分區就只有一個數組元素。


scala>
var rdd
= sc.makeRDD(1
to 10,3)

rdd: org.apache.spark.rdd.RDD[Int]
=
ParallelCollectionRDD[38]
at makeRDD at :21

scala> rdd.partitions.size

res33:
Int
=
3
//該RDD有3個分區

scala> rdd.glom().collect

res35:
Array[Array[Int]]
=
Array(Array(1,
2,
3),
Array(4,
5,
6),
Array(7,
8,
9,
10))

//glom將每個分區中的元素放到一個數組中，這樣，結果就變成了3個數組

9.union

該函數比較簡單，就是將兩個RDD進行合併，不去重。


scala>
var rdd1
= sc.makeRDD(1
to 2,1)

rdd1: org.apache.spark.rdd.RDD[Int]
=
ParallelCollectionRDD[45]
at makeRDD at :21

 

scala> rdd1.collect

res42:
Array[Int]
=
Array(1,
2)

 

scala>
var rdd2
= sc.makeRDD(2
to 3,1)

rdd2: org.apache.spark.rdd.RDD[Int]
=
ParallelCollectionRDD[46]
at makeRDD at :21

 

scala> rdd2.collect

res43:
Array[Int]
=
Array(2,
3)

 

scala> rdd1.union(rdd2).collect

res44:
Array[Int]
=
Array(1,
2,
2,
3)

10.intersection

def intersection(other: RDD[T]): RDD[T]
def intersection(other: RDD[T], numPartitions: Int): RDD[T]
def intersection(other: RDD[T], partitioner: Partitioner)(implicit ord: Ordering[T] = null): RDD[T]

該函數返回兩個RDD的交集，並且去重。
參數numPartitions指定返回的RDD的分區數。
參數partitioner用於指定分區函數


scala>
var rdd1
= sc.makeRDD(1
to 2,1)

rdd1: org.apache.spark.rdd.RDD[Int]
=
ParallelCollectionRDD[45]
at makeRDD at :21

 

scala> rdd1.collect

res42:
Array[Int]
=
Array(1,
2)

 

scala>
var rdd2
= sc.makeRDD(2
to 3,1)

rdd2: org.apache.spark.rdd.RDD[Int]
=
ParallelCollectionRDD[46]
at makeRDD at :21

 

scala> rdd2.collect

res43:
Array[Int]
=
Array(2,
3)

 

scala>
rdd1.intersection(rdd2).collect

res45:
Array[Int]
=
Array(2)

 

scala>
var rdd3
= rdd1.intersection(rdd2)

rdd3: org.apache.spark.rdd.RDD[Int]
=
MapPartitionsRDD[59]
at intersection at :25

 

scala> rdd3.partitions.size

res46:
Int
=
1

 

scala>
var rdd3
= rdd1.intersection(rdd2,2)

rdd3: org.apache.spark.rdd.RDD[Int]
=
MapPartitionsRDD[65]
at intersection at :25

 

scala> rdd3.partitions.size
res47:
Int
=
2
來源： http://lxw1234.com/archives/2015/07/345.htm

11.subtract

def subtract(other: RDD[T]): RDD[T]
def subtract(other: RDD[T], numPartitions: Int): RDD[T]
def subtract(other: RDD[T], partitioner: Partitioner)(implicit ord: Ordering[T] = null): RDD[T]

該函數類似於intersection，但返回在RDD中出現，並且不在otherRDD中出現的元素，不去重。
參數含義同intersection


scala>
var rdd1
= sc.makeRDD(Seq(1,2,2,3))

rdd1: org.apache.spark.rdd.RDD[Int]
=
ParallelCollectionRDD[66]
at makeRDD at :21

 

scala> rdd1.collect

res48:
Array[Int]
=
Array(1,
2,
2,
3)

 

scala>
var rdd2
= sc.makeRDD(3
to 4)

rdd2: org.apache.spark.rdd.RDD[Int]
=
ParallelCollectionRDD[67]
at makeRDD at :21

 

scala> rdd2.collect

res49:
Array[Int]
=
Array(3,
4)

 

scala> rdd1.subtract(rdd2).collect

res50:
Array[Int]
=
Array(1,
2,
2)

12.mapPartitions

def mapPartitions[U](f: (Iterator[T]) => Iterator[U], preservesPartitioning: Boolean = false)(implicit arg0: ClassTag[U]): RDD[U]

該函數和map函數類似，只不過映射函數的參數由RDD中的每一個元素變成了RDD中每一個分區的迭代器。如果在映射的過程中需要頻繁創建額外的對象，使用mapPartitions要比map高效的過。

比如，將RDD中的所有數據通過JDBC連接寫入數據庫，如果使用map函數，可能要爲每一個元素都創建一個connection，這樣開銷很大，如果使用mapPartitions，那麼只需要針對每一個分區建立一個connection。

參數preservesPartitioning表示是否保留父RDD的partitioner分區信息。


var rdd1
= sc.makeRDD(1
to 5,2)

//rdd1有兩個分區

scala>
var rdd3
= rdd1.mapPartitions{
x =>
{

|
var result
=
List[Int]()

|
var i
=
0

|
while(x.hasNext){

| i
+= x.next()

|
}

| result.::(i).iterator
//向list中添加元素，

|
}}

rdd3: org.apache.spark.rdd.RDD[Int]
=
MapPartitionsRDD[84]
at mapPartitions at :23

 

//rdd3將rdd1中每個分區中的數值累加

scala> rdd3.collect

res65:
Array[Int]
=
Array(3,
12)

scala> rdd3.partitions.size
res66:
Int
=
2



12.2 mapPartitions返回一個iterator。

sc.parallelize(1 to sampleNum).sample(false, fraction).mapPartitions(
  it => {
    val bitmap = new ExtBitmap()
    val bucket = sampleNum * index.toLong
    while (it.hasNext) {
      bitmap.set(it.next() + bucket)
    }
    Set(index -> bitmap).iterator
  }
)

12.3.mapPartitions返回一個iterator。

.mapPartitions {
  it => new Iterator[String] {
    var first = true
    def hasNext = it.hasNext
    def next = {
      val n = it.next
      val res = if (first) s"$head\n$n" else n
      first = false
      res
    }
  }
}

12.4.

rdd.mapPartitions{
  it =>
     val random = new Random()
    it.map{
      case (k,v) =>
        val key = if(br_topN.value.contains(k)) k+"_"+random.nextInt(rNum) else k
        key -> v
    }
}

13.mapPartitionWithIndex

def mapPartitionsWithIndex[U](f: (Int, Iterator[T]) => Iterator[U], preservesPartitioning: Boolean = false)(implicit arg0: ClassTag[U]): RDD[U]

函數作用同mapPartitions，不過提供了兩個參數，第一個參數爲分區的索引。


var rdd1
= sc.makeRDD(1
to 5,2)

//rdd1有兩個分區

var rdd2
= rdd1.mapPartitionsWithIndex{

(x,iter)
=>
{ //(x,iter)表示兩個參數，x是分區索引

var result
=
List[String]()

var i
=
0

while(iter.hasNext){

i += iter.next()

}

result.::(x
+
"|"
+ i).iterator


}

}

//rdd2將rdd1中每個分區的數字累加，並在每個分區的累加結果前面加了分區索引

scala> rdd2.collect

res13:
Array[String]
=
Array(0|3,
1|12)




14.zip

def zip[U](other: RDD[U])(implicit arg0: ClassTag[U]): RDD[(T, U)]

zip函數用於將兩個RDD組合成Key/Value形式的RDD,這裏默認兩個RDD的partition數量以及元素數量都相同，否則會拋出異常。


scala>
var rdd1
= sc.makeRDD(1
to 5,2)

rdd1: org.apache.spark.rdd.RDD[Int]
=
ParallelCollectionRDD[1]
at makeRDD at :21

 

scala>
var rdd2
= sc.makeRDD(Seq("A","B","C","D","E"),2)

rdd2: org.apache.spark.rdd.RDD[String]
=
ParallelCollectionRDD[2]
at makeRDD at :21

 

scala> rdd1.zip(rdd2).collect

res0:
Array[(Int,
String)]
=
Array((1,A),
(2,B),
(3,C),
(4,D),
(5,E))

 

scala> rdd2.zip(rdd1).collect

res1:
Array[(String,
Int)]
=
Array((A,1),
(B,2),
(C,3),
(D,4),
(E,5))

 

scala>
var rdd3
=
sc.makeRDD(Seq("A","B","C","D","E"),3)

rdd3: org.apache.spark.rdd.RDD[String]
=
ParallelCollectionRDD[5]
at makeRDD at :21

 

scala> rdd1.zip(rdd3).collect

java.lang.IllegalArgumentException:
Can't zip RDDs with unequal numbers of partitions
//如果兩個RDD分區數不同，則拋出異常



15.zipPattitions

zipPartitions函數將多個RDD按照partition組合成爲新的RDD，該函數需要組合的RDD具有相同的分區數，但對於每個分區內的元素數量沒有要求。

該函數有好幾種實現，可分爲三類：

參數是一個RDD

def zipPartitions[B, V](rdd2: RDD[B])(f: (Iterator[T], Iterator[B]) => Iterator[V])(implicit arg0: ClassTag[B], arg1: ClassTag[V]): RDD[V]

def zipPartitions[B, V](rdd2: RDD[B], preservesPartitioning: Boolean)(f: (Iterator[T], Iterator[B]) => Iterator[V])(implicit arg0: ClassTag[B], arg1: ClassTag[V]): RDD[V]

這兩個區別就是參數preservesPartitioning，是否保留父RDD的partitioner分區信息

映射方法f參數爲兩個RDD的迭代器。

16.sample

sample(withReplacement,fraction,seed):以指定的隨機種子隨機抽樣出數量爲fraction的數據，withReplacement表示是抽出的數據是否放回，true爲有放回的抽樣，false爲無放回的抽樣

(例5)：從RDD中隨機且有放回的抽出50%的數據，隨機種子值爲3（即可能以1 2 3的其中一個起始值）

//省略

val rdd = sc.parallelize(1 to 10)

val sample1 = rdd.sample(true,0.5,3)

sample1.collect.foreach(x => print(x + " "))

sc.stop

【spark，rdd，2】RDD基本轉換算子

12.mapPartitions

13.mapPartitionWithIndex

【性能診斷工具，1】awr

【oracle11g,8】數據字典和字符集

【體系結構問題解決，1】解決4031錯誤方法

【oracle11g,11】redo日誌文件2 ：日誌恢復 (重點)

【ora10,4】oracle後臺進程介紹：

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結