spark中stage的劃分依據action算子進行,每一次action(reduceByKey等)算子都會觸發一次shuffle過程,該過程涉及到數據的重新分區。spark中的分區器包括HashPartitioner及RangePartitioner兩種。HashPartitioner根據key進行分區,當某一個key對應的數據較多時會出現數據傾斜的情況,又因爲每一個partition對應一個task,數據較多的task會耗費較多的時間,影響spark任務運行的時間。此時,可以使用RangePartitioner分區器,RangePartitioner基於水塘抽樣算法,可以在不知道整體數據量的情況下,等概率地取到每條數據。
一、HashPartitioner
/**
* A [[org.apache.spark.Partitioner]] that implements hash-based partitioning using
* Java's `Object.hashCode`.
*
* Java arrays have hashCodes that are based on the arrays' identities rather than their contents,
* so attempting to partition an RDD[Array[_]] or RDD[(Array[_], _)] using a HashPartitioner will
* produce an unexpected or incorrect result.
*/
class HashPartitioner(partitions: Int) extends Partitioner {
require(partitions >= 0, s"Number of partitions ($partitions) cannot be negative.")
def numPartitions: Int = partitions
def getPartition(key: Any): Int = key match {
case null => 0
case _ => Utils.nonNegativeMod(key.hashCode, numPartitions)
}
override def equals(other: Any): Boolean = other match {
case h: HashPartitioner =>
h.numPartitions == numPartitions
case _ =>
false
}
override def hashCode: Int = numPartitions
}
/* Calculates 'x' modulo 'mod', takes to consideration sign of x,
* i.e. if 'x' is negative, than 'x' % 'mod' is negative too
* so function return (x % mod) + mod in that case.
*/
def nonNegativeMod(x: Int, mod: Int): Int = {
val rawMod = x % mod
rawMod + (if (rawMod < 0) mod else 0)
}
HashPartitioner主要根據RDD的key進行分區,當key爲null時,對應的partitionId爲0,當key不爲null時,partitionId計算過程爲:先將key的hashcode值對分區個數numPartitions取餘,當餘數小於0時,將餘數與numPartitions相加,否則與0相加。很明顯,相同key的數據一定會分到同一個分區中,可能導致數據傾斜,進而影響spark運行速度。
二、RangePartitioner
HashPartitioner分區可能導致每個分區中數據量的不均勻。而RangePartitioner分區則儘量保證每個分區中數據量的均勻,將一定範圍內的數映射到某一個分區內。分區與分區之間數據是有序的,但分區內的元素是不能保證順序的。
1、水塘抽樣算法原理
對於一長度爲n(大到無法加載到內存中)的數組N,如何等概率地從中取出k個元素,組成數組R?
水塘抽樣算法做法如下:首先,去數組N前k個元素放入數組R中;然後遍歷數組N中剩餘元素,對於數組N中第i個元素N[i-1](i大於k),隨機生成一個數rand,若rand<k,則將N[i-1]替換掉數組R中第rand個元素,否則,保持原樣。可以得知,取得數組N中每一元素的概率均爲k/n。對於數組N中前k個元素而言,由於第一次就將其取出,因此在後續迭代過程中只需保持原樣即可,概率爲,對於數組N中剩餘的n-k個元素,只需在遍歷到其所在的位置時替換掉現有元素中的一個並在後續步驟中保持原樣,概率爲。
2、RangePartitioner
// An array of upper bounds for the first (partitions - 1) partitions
private var rangeBounds: Array[K] = {
if (partitions <= 1) {
Array.empty
} else {
// This is the sample size we need to have roughly balanced output partitions, capped at 1M.
// Cast to double to avoid overflowing ints or longs
val sampleSize = math.min(samplePointsPerPartitionHint.toDouble * partitions, 1e6)
// Assume the input partitions are roughly balanced and over-sample a little bit.
val sampleSizePerPartition = math.ceil(3.0 * sampleSize / rdd.partitions.length).toInt
val (numItems, sketched) = RangePartitioner.sketch(rdd.map(_._1), sampleSizePerPartition)
if (numItems == 0L) {
Array.empty
} else {
// If a partition contains much more than the average number of items, we re-sample from it
// to ensure that enough items are collected from that partition.
val fraction = math.min(sampleSize / math.max(numItems, 1L), 1.0)
val candidates = ArrayBuffer.empty[(K, Float)]
val imbalancedPartitions = mutable.Set.empty[Int]
sketched.foreach { case (idx, n, sample) =>
if (fraction * n > sampleSizePerPartition) {
imbalancedPartitions += idx
} else {
// The weight is 1 over the sampling probability.
val weight = (n.toDouble / sample.length).toFloat
for (key <- sample) {
candidates += ((key, weight))
}
}
}
if (imbalancedPartitions.nonEmpty) {
// Re-sample imbalanced partitions with the desired sampling probability.
val imbalanced = new PartitionPruningRDD(rdd.map(_._1), imbalancedPartitions.contains)
val seed = byteswap32(-rdd.id - 1)
val reSampled = imbalanced.sample(withReplacement = false, fraction, seed).collect()
val weight = (1.0 / fraction).toFloat
candidates ++= reSampled.map(x => (x, weight))
}
RangePartitioner.determineBounds(candidates, math.min(partitions, candidates.size))
}
}
}
/**
* Sketches the input RDD via reservoir sampling on each partition.
*
* @param rdd the input RDD to sketch
* @param sampleSizePerPartition max sample size per partition
* @return (total number of items, an array of (partitionId, number of items, sample))
*/
def sketch[K : ClassTag](
rdd: RDD[K],
sampleSizePerPartition: Int): (Long, Array[(Int, Long, Array[K])]) = {
val shift = rdd.id
// val classTagK = classTag[K] // to avoid serializing the entire partitioner object
val sketched = rdd.mapPartitionsWithIndex { (idx, iter) =>
val seed = byteswap32(idx ^ (shift << 16))
val (sample, n) = SamplingUtils.reservoirSampleAndCount(
iter, sampleSizePerPartition, seed)
Iterator((idx, n, sample))
}.collect()
val numItems = sketched.map(_._2).sum
(numItems, sketched)
}
private[spark] object SamplingUtils {
/**
* Reservoir sampling implementation that also returns the input size.
*
* @param input input size
* @param k reservoir size
* @param seed random seed
* @return (samples, input size)
*/
def reservoirSampleAndCount[T: ClassTag](
input: Iterator[T],
k: Int,
seed: Long = Random.nextLong())
: (Array[T], Long) = {
val reservoir = new Array[T](k)
// Put the first k elements in the reservoir.
var i = 0
while (i < k && input.hasNext) {
val item = input.next()
reservoir(i) = item
i += 1
}
// If we have consumed all the elements, return them. Otherwise do the replacement.
if (i < k) {
// If input size < k, trim the array to return only an array of input size.
val trimReservoir = new Array[T](i)
System.arraycopy(reservoir, 0, trimReservoir, 0, i)
(trimReservoir, i)
} else {
// If input size > k, continue the sampling process.
var l = i.toLong
val rand = new XORShiftRandom(seed)
while (input.hasNext) {
val item = input.next()
l += 1
// There are k elements in the reservoir, and the l-th element has been
// consumed. It should be chosen with probability k/l. The expression
// below is a random long chosen uniformly from [0,l)
val replacementIndex = (rand.nextDouble() * l).toLong
if (replacementIndex < k) {
reservoir(replacementIndex.toInt) = item
}
}
(reservoir, l)
}
}
RangePartitioner分區執行原理:
1、計算總體的數據抽樣大小sampleSize,計算規則是:至少每個分區抽取20個數據或者最多1M的數據量。
2、根據sampleSize和分區數量計算每個分區的數據抽樣樣本數量最大值sampleSizePerPartition
3、根據以上兩個值進行水塘抽樣,返回RDD的總數據量,分區ID和每個分區的採樣數據。
4、計算出數據量較大的分區通過RDD.sample進行重新抽樣。
5、通過抽樣數組 candidates: ArrayBuffer[(K, wiegth)]計算出分區邊界的數組BoundsArray
6、在取數據時,如果分區數小於128則直接獲取,如果大於128則通過二分法,獲取當前Key屬於那個區間,返回對應的BoundsArray下標即爲partitionsID