下面是StorageLevel類的代碼解釋
/**
* :: DeveloperApi ::
* Flags for controlling the storage of an RDD. Each StorageLevel records whether to use memory,
* or ExternalBlockStore, whether to drop the RDD to disk if it falls out of memory or
* ExternalBlockStore, whether to keep the data in memory in a serialized format, and whether
* to replicate the RDD partitions on multiple nodes.
*
* 用於控制RDD存儲的標誌,每個StorageLevel記錄是否使用內存或ExternalBlockStore,如果它脫離內存或ExternalBlockStore,
* 是否將RDD丟棄到磁盤,是否將數據保存在內存中的序列化格式,以及是否複製多個節點上的RDD分區
*
* The [org.apache.spark.storage.StorageLevel$]singleton object contains some static constants
* for commonly useful storage levels. To create your own storage level object, use the
* factory method of the singleton object (`StorageLevel(...)`).
* [org.apache.spark.storage.StorageLevel $] singleton對象包含一些常用的存儲級別的靜態常量,
* 要創建自己的存儲級別對象,請使用單例對象(`StorageLevel(...)`)的工廠方法
*/
StorageLevel的成員如下:
private var _useDisk: Boolean,
private var _useMemory: Boolean,
private var _useOffHeap: Boolean,//使用擴展存儲
private var _deserialized: Boolean,
private var _replication: Int = 1) //默認的複本數是1
StorageLevel中的Object方法解釋如下:
object StorageLevel {
//不會保存任務數據
val NONE = new StorageLevel(false, false, false, false)
//直接將RDD的partition保存在該節點的Disk上
val DISK_ONLY = new StorageLevel(true, false, false, false)
//直接將RDD的partition保存在該節點的Disk上,在其他節點上保存一個相同的備份
val DISK_ONLY_2 = new StorageLevel(true, false, false, false, 2)
//將RDD的partition對應的原生的Java Object保存在JVM中,如果RDD太大導致它的部分partition不能存儲在內存中
//那麼這些partition將不會緩存,並且需要的時候被重新計算,默認緩存的級別
val MEMORY_ONLY = new StorageLevel(false, true, false, true)
//將RDD的partition對應的原生的Java Object保存在JVM中,在其他節點上保存一個相同的備份
val MEMORY_ONLY_2 = new StorageLevel(false, true, false, true, 2)
val MEMORY_ONLY_SER = new StorageLevel(false, true, false, false)
val MEMORY_ONLY_SER_2 = new StorageLevel(false, true, false, false, 2)
//將RDD的partition反序列化後的對象存儲在JVM中,如果RDD太大導致它的部分partition不能存儲在內存中
//超出的partition將被保存在Disk上,並且在需要時讀取
val MEMORY_AND_DISK = new StorageLevel(true, true, false, true)
//在其他節點上保存一個相同的備份
val MEMORY_AND_DISK_2 = new StorageLevel(true, true, false, true, 2)
val MEMORY_AND_DISK_SER = new StorageLevel(true, true, false, false)
val MEMORY_AND_DISK_SER_2 = new StorageLevel(true, true, false, false, 2)
//將RDD的partition序列化後存儲在Tachyon中
val OFF_HEAP = new StorageLevel(false, false, true, false)
下面是我們的RDD存儲的調用
RDD和Block關係如下:
一個RDD有多個Partition,每個Partition對應一個Block塊,也就是說一個RDD有多個Block,同時每個BlockId又有擁有唯一的編號BlockId,對應的數據塊的編號規則爲:“rdd_”+”rddid_”+”splitIndex”,其中這個splitIndex對應的是數據塊對應的Partition的序列號
RDD的存儲過程分析如下:
/**
* Mark this RDD for persisting using the specified level.
* 標記此RDD以使用指定的級別進行持久化
* this.type表示當前對象(this)的類型。this指代當前的對象。
* this.type被用於變量,函數參數和函數返回值的類型聲明
* @param newLevel the target storage level
* @param allowOverride whether to override any existing level with the new one
*/
private def persist(newLevel: StorageLevel, allowOverride: Boolean): this.type = {
// TODO: Handle changes of StorageLevel
if (storageLevel != StorageLevel.NONE && newLevel != storageLevel && !allowOverride) {
throw new UnsupportedOperationException(
"Cannot change storage level of an RDD after it was already assigned a level")
}
// If this is the first time this RDD is marked for persisting, register it
// with the SparkContext for cleanups and accounting. Do this only once.
//如果這是RDD第一次標記爲持久性,請註冊與SparkContext進行清理和計費,只做一次
if (storageLevel == StorageLevel.NONE) {
sc.cleaner.foreach(_.registerRDDForCleanup(this))
sc.persistRDD(this)
}
storageLevel = newLevel
this
}
/**
* 設置RDD存儲級別在操作之後完成,這裏只能分配RDD尚未確認的新存儲級別,檢查點是一個例別
* Set this RDD's storage level to persist its values across operations after the first time
* it is computed. This can only be used to assign a new storage level if the RDD does not
* have a storage level set yet. Local checkpointing is an exception.
* 設置此RDD的存儲級別,以便在第一次操作之後保持其值計算,如果RDD沒有,這隻能用於分配新的存儲級別
*還有一個存儲級別,本地檢查點是一個例外。
*/
def persist(newLevel: StorageLevel): this.type = {
if (isLocallyCheckpointed) {
//之前已經調用過localCheckpoint(),這裏應該標記RDD待久化,在這裏我們應該重寫舊的存儲級別,一個是由用戶顯式請求
// This means the user previously called localCheckpoint(), which should have already
// marked this RDD for persisting. Here we should override the old storage level with
// one that is explicitly requested by the user (after adapting it to use disk).
persist(LocalRDDCheckpointData.transformStorageLevel(newLevel), allowOverride = true)
} else {
persist(newLevel, allowOverride = false)
}
}
/**
* Persist this RDD with the default storage level (`MEMORY_ONLY`).
* 持久化RDD,默認存儲級別MEMORY_ONLY,內存存儲
* */
def persist(): this.type = persist(StorageLevel.MEMORY_ONLY)
/**
* Persist this RDD with the default storage level (`MEMORY_ONLY`).
* 持久化RDD使用默認存儲級別(內存存儲)
* */
def cache(): this.type = persist()
/**
* Mark the RDD as non-persistent, and remove all blocks for it from memory and disk.
* 刪除持久化RDD,同時刪除內存和硬盤
* @param blocking Whether to block until all blocks are deleted.
* @return This RDD.
*/
def unpersist(blocking: Boolean = true): this.type = {
logInfo("Removing RDD " + id + " from persistence list")
sc.unpersistRDD(id, blocking)
storageLevel = StorageLevel.NONE //將存儲級別設置爲None
this
}
/**
* Get the RDD's current storage level, or StorageLevel.NONE if none is set.
* 獲得RDD當前的存儲級別
* */
def getStorageLevel: StorageLevel = storageLevel
下面是我們的RDD的Iterator方法的解析如下:
/**
* Task的執行起點,計算由此開始
* Internal method to this RDD; will read from cache if applicable(可用), or otherwise(否則) compute it.
* This should ''not'' be called by users directly, but is available for implementors of custom
* subclasses of RDD.
* RDD內部方法,從緩存中讀取可用,如果沒有則計算它,
*/
final def iterator(split: Partition, context: TaskContext): Iterator[T] = {
//如果存儲級別不是NONE 那麼先檢查是否有緩存,沒有緩存則要進行計算
if (storageLevel != StorageLevel.NONE) {
getOrCompute(split, context)
//SparkEnv包含運行時節點所需要的環境信息
//cacheManager負責調用BlockManager來管理RDD的緩存,如果當前RDD原來計算過並且把結果緩存起來.
//接下來的運行都可以通過BlockManager來直接讀取緩存後返回
} else {
//如果沒有緩存,存在檢查點時直接獲取中間結果
computeOrReadCheckpoint(split, context)
}
}
getOrComputey方法如下:
/** Gets or computes an RDD partition. Used by RDD.iterator() when an RDD is cached.
* 獲取或計算一個RDD的分區,當RDD被緩存時由RDD.iterator()使用
* */
private[spark] def getOrCompute(partition: Partition, context: TaskContext): Iterator[T] = {
//通過RDD的編號和Partition的序號獲取到數據塊Block的編號
val blockId = RDDBlockId(id, partition.index)
var readCachedBlock = true
// This method is called on executors, so we need call SparkEnv.get instead of sc.env
//因爲這個方法是由executor來調用這個方法,所以可以使用SparkEnv代替sc.env
//先根據數據塊BlockId來讀取數據,然後更新數據,這個方法是讀寫數據的入口
SparkEnv.get.blockManager.getOrElseUpdate(blockId, storageLevel, elementClassTag,
//如果是數據塊不存在,則嘗試讀取檢查點的結果進行迭代計算
() => {
readCachedBlock = false
computeOrReadCheckpoint(partition, context)
}) match {
case Left(blockResult) =>
if (readCachedBlock) {
val existingMetrics = context.taskMetrics().inputMetrics
existingMetrics.incBytesRead(blockResult.bytes)
new InterruptibleIterator[T](context, blockResult.data.asInstanceOf[Iterator[T]]) {
override def next(): T = {
existingMetrics.incRecordsRead(1)
delegate.next()
}
}
} else {
new InterruptibleIterator(context, blockResult.data.asInstanceOf[Iterator[T]])
}
case Right(iter) =>
new InterruptibleIterator(context, iter.asInstanceOf[Iterator[T]])
}
}
下面是我們的getOrElseUpdate方法的解析如下:
/**
*這個方法是我們的Spark存寫數據的入口點,同時這個方法由我們的Executor調用
* Retrieve the given block if it exists, otherwise call the provided `makeIterator` method
* to compute the block, persist it, and return its values.
*
* @return either a BlockResult if the block was successfully cached, or an iterator if the block
* could not be cached.
*/
def getOrElseUpdate[T](
blockId: BlockId,
level: StorageLevel,
classTag: ClassTag[T],
makeIterator: () => Iterator[T]): Either[BlockResult, Iterator[T]] = {
// Attempt to read the block from local or remote storage. If it's present, then we don't need
// to go through the local-get-or-put path.
//讀取數據的入口點,嘗試從本地或者遠程讀取數據
get[T](blockId)(classTag) match {
case Some(block) =>
return Left(block)
case _ =>
// Need to compute the block.
}
// Initially we hold no locks on this block.
//寫數據入口
doPutIterator(blockId, makeIterator, level, classTag, keepReadLock = true) match {
case None =>
// doPut() didn't hand work back to us, so the block already existed or was successfully
// stored. Therefore, we now hold a read lock on the block.
val blockResult = getLocalValues(blockId).getOrElse {
// Since we held a read lock between the doPut() and get() calls, the block should not
// have been evicted, so get() not returning the block indicates some internal error.
releaseLock(blockId)
throw new SparkException(s"get() failed for block $blockId even though we held a lock")
}
// We already hold a read lock on the block from the doPut() call and getLocalValues()
// acquires the lock again, so we need to call releaseLock() here so that the net number
// of lock acquisitions is 1 (since the caller will only call release() once).
releaseLock(blockId)
Left(blockResult)
case Some(iter) =>
// The put failed, likely because the data was too large to fit in memory and could not be
// dropped to disk. Therefore, we need to pass the input iterator back to the caller so
// that they can decide what to do with the values (e.g. process them without caching).
Right(iter)
}
}