spark中一些不是很有意義的數據結構

原創

tydhot

2020-06-27 23:58

Spark版本2.4.0

CompactBuffer是一個基於scala的ArrayBuffer進行了優化的object數組。

原生的ArrayBuffer在缺省情況就會構造一個大小爲16的數組，這在一些小數據量，只有1個2個的情況，其實並不是很優雅的做法。

private var element0: T = _
private var element1: T = _

// Number of elements, including our two in the main object
private var curSize = 0

// Array for extra elements
private var otherElements: Array[T] = null

在CompactBuffer中，當數據量小於2的時候，只用到element0和element1字段即可，用來存放最前面兩個元素。當元素大於三個的時候再申請一個otherElements數組用來存放後續數據，達到在小數據量下對於object數組的內存優化。

MedianHeap可以快速從一個有序集合中得到中位數。

/**
 * Stores all the numbers less than the current median in a smallerHalf,
 * i.e median is the maximum, at the root.
 */
private[this] var smallerHalf = PriorityQueue.empty[Double](ord)

/**
 * Stores all the numbers greater than the current median in a largerHalf,
 * i.e median is the minimum, at the root.
 */
private[this] var largerHalf = PriorityQueue.empty[Double](ord.reverse)

MedianHeap存在兩個集合，smallerHalf用來存放比中位數小的有序隊列，largerHalf則反之。

def median: Double = {
  if (isEmpty) {
    throw new NoSuchElementException("MedianHeap is empty.")
  }
  if (largerHalf.size == smallerHalf.size) {
    (largerHalf.head + smallerHalf.head) / 2.0
  } else if (largerHalf.size > smallerHalf.size) {
    largerHalf.head
  } else {
    smallerHalf.head
  }
}

當兩個集合相等，則前後兩集合最大最小數的平均值就是所需要的中位數，反之則是數量較大隊列的隊首元素。

private[this] def rebalance(): Unit = {
  if (largerHalf.size - smallerHalf.size > 1) {
    smallerHalf.enqueue(largerHalf.dequeue())
  }
  if (smallerHalf.size - largerHalf.size > 1) {
    largerHalf.enqueue(smallerHalf.dequeue)
  }
}

由於本身兩個隊列實現爲PriorityQueue，本身則爲有序的。因此，當通過reblance()方法平衡兩個集合時，只要將數量較大的集合元素往較小的隊列不斷插入直到兩者數量相差小於1即可。

TimeStampedHashMap是一個自帶最近一次鍵值對訪問時間的Map，可以達到去除長時間沒有用到的鍵值對的目的。

override def += (kv: (A, B)): this.type = {
  kv match { case (a, b) => internalMap.put(a, TimeStampedValue(b, currentTime)) }
  this
}

private[spark] case class TimeStampedValue[V](value: V, timestamp: Long)

該Map的key沒有特殊，而是在value的存放爲一個value和時間戳的case class，在原本基礎上增加了最近一次訪問時間的存儲，當需要去除長時間未使用的鍵值對的時候只需要遍歷一遍，去掉目標時間之前的鍵值對即可。

BoundedPriorityQueue在PriorityQueue基礎上做了一層封裝，當隊列數量滿的時候，如果新的數據加進來，將會和隊列中最小值相比，如果大於就將其替換，達到隊列當中一直是要存儲的集合中最大的幾個。

private def maybeReplaceLowest(a: A): Boolean = {
  val head = underlying.peek()
  if (head != null && ord.gt(a, head)) {
    underlying.poll()
    underlying.offer(a)
  } else {
    false
  }
}

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

spark中一些不是很有意義的數據結構

24-5-18 X

spark反壓速率計算

spark閉包清理器ClosureCleaner

Java1.8HashMap一段註釋的解釋

spark job生成的時間驅動

spark RadixSort基數排序源碼實現

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結