存儲級別和存儲調用

下面是StorageLevel類的代碼解釋


/**
 * :: DeveloperApi ::
 * Flags for controlling the storage of an RDD. Each StorageLevel records whether to use memory,
 * or ExternalBlockStore, whether to drop the RDD to disk if it falls out of memory or
 * ExternalBlockStore, whether to keep the data in memory in a serialized format, and whether
 * to replicate the RDD partitions on multiple nodes.
  *
  * 用於控制RDD存儲的標誌,每個StorageLevel記錄是否使用內存或ExternalBlockStore,如果它脫離內存或ExternalBlockStore,
  * 是否將RDD丟棄到磁盤,是否將數據保存在內存中的序列化格式,以及是否複製多個節點上的RDD分區
 *
 * The [org.apache.spark.storage.StorageLevel$]singleton object contains some static constants
 * for commonly useful storage levels. To create your own storage level object, use the
 * factory method of the singleton object (`StorageLevel(...)`).
  * [org.apache.spark.storage.StorageLevel $] singleton對象包含一些常用的存儲級別的靜態常量,
  * 要創建自己的存儲級別對象,請使用單例對象(`StorageLevel(...)`)的工廠方法
 */

StorageLevel的成員如下:

    private var _useDisk: Boolean,
    private var _useMemory: Boolean,
    private var _useOffHeap: Boolean,//使用擴展存儲
    private var _deserialized: Boolean,
    private var _replication: Int = 1) //默認的複本數是1

StorageLevel中的Object方法解釋如下:

object StorageLevel {
  //不會保存任務數據
  val NONE = new StorageLevel(false, false, false, false)
  //直接將RDD的partition保存在該節點的Disk上
  val DISK_ONLY = new StorageLevel(true, false, false, false)
  //直接將RDD的partition保存在該節點的Disk上,在其他節點上保存一個相同的備份
  val DISK_ONLY_2 = new StorageLevel(true, false, false, false, 2)
  //將RDD的partition對應的原生的Java Object保存在JVM中,如果RDD太大導致它的部分partition不能存儲在內存中
  //那麼這些partition將不會緩存,並且需要的時候被重新計算,默認緩存的級別
  val MEMORY_ONLY = new StorageLevel(false, true, false, true)
  //將RDD的partition對應的原生的Java Object保存在JVM中,在其他節點上保存一個相同的備份
  val MEMORY_ONLY_2 = new StorageLevel(false, true, false, true, 2)  
  val MEMORY_ONLY_SER = new StorageLevel(false, true, false, false)
  val MEMORY_ONLY_SER_2 = new StorageLevel(false, true, false, false, 2)
  //將RDD的partition反序列化後的對象存儲在JVM中,如果RDD太大導致它的部分partition不能存儲在內存中
  //超出的partition將被保存在Disk上,並且在需要時讀取
  val MEMORY_AND_DISK = new StorageLevel(true, true, false, true)
  //在其他節點上保存一個相同的備份
  val MEMORY_AND_DISK_2 = new StorageLevel(true, true, false, true, 2)
  val MEMORY_AND_DISK_SER = new StorageLevel(true, true, false, false)
  val MEMORY_AND_DISK_SER_2 = new StorageLevel(true, true, false, false, 2)
  //將RDD的partition序列化後存儲在Tachyon中
  val OFF_HEAP = new StorageLevel(false, false, true, false)

下面是我們的RDD存儲的調用
RDD和Block關係如下:
一個RDD有多個Partition,每個Partition對應一個Block塊,也就是說一個RDD有多個Block,同時每個BlockId又有擁有唯一的編號BlockId,對應的數據塊的編號規則爲:“rdd_”+”rddid_”+”splitIndex”,其中這個splitIndex對應的是數據塊對應的Partition的序列號

RDD的存儲過程分析如下:

  /**
   * Mark this RDD for persisting using the specified level.
    * 標記此RDD以使用指定的級別進行持久化
   * this.type表示當前對象(this)的類型。this指代當前的對象。
   * this.type被用於變量,函數參數和函數返回值的類型聲明
   * @param newLevel the target storage level
   * @param allowOverride whether to override any existing level with the new one
   */
  private def persist(newLevel: StorageLevel, allowOverride: Boolean): this.type = {
    // TODO: Handle changes of StorageLevel
    if (storageLevel != StorageLevel.NONE && newLevel != storageLevel && !allowOverride) {
      throw new UnsupportedOperationException(
        "Cannot change storage level of an RDD after it was already assigned a level")
    }
    // If this is the first time this RDD is marked for persisting, register it
    // with the SparkContext for cleanups and accounting. Do this only once.
    //如果這是RDD第一次標記爲持久性,請註冊與SparkContext進行清理和計費,只做一次
    if (storageLevel == StorageLevel.NONE) {
      sc.cleaner.foreach(_.registerRDDForCleanup(this))
      sc.persistRDD(this)
    }
    storageLevel = newLevel
    this
  }

  /**
   * 設置RDD存儲級別在操作之後完成,這裏只能分配RDD尚未確認的新存儲級別,檢查點是一個例別
   * Set this RDD's storage level to persist its values across operations after the first time
   * it is computed. This can only be used to assign a new storage level if the RDD does not
   * have a storage level set yet. Local checkpointing is an exception.
    * 設置此RDD的存儲級別,以便在第一次操作之後保持其值計算,如果RDD沒有,這隻能用於分配新的存儲級別
    *還有一個存儲級別,本地檢查點是一個例外。
   */
  def persist(newLevel: StorageLevel): this.type = {
    if (isLocallyCheckpointed) {
      //之前已經調用過localCheckpoint(),這裏應該標記RDD待久化,在這裏我們應該重寫舊的存儲級別,一個是由用戶顯式請求
      // This means the user previously called localCheckpoint(), which should have already
      // marked this RDD for persisting. Here we should override the old storage level with
      // one that is explicitly requested by the user (after adapting it to use disk).
      persist(LocalRDDCheckpointData.transformStorageLevel(newLevel), allowOverride = true)
    } else {
      persist(newLevel, allowOverride = false)
    }
  }

  /** 
   *  Persist this RDD with the default storage level (`MEMORY_ONLY`). 
   *  持久化RDD,默認存儲級別MEMORY_ONLY,內存存儲
   *  */
  def persist(): this.type = persist(StorageLevel.MEMORY_ONLY)

  /** 
   *  Persist this RDD with the default storage level (`MEMORY_ONLY`). 
   *  持久化RDD使用默認存儲級別(內存存儲)
   *  */
  def cache(): this.type = persist()

  /**
   * Mark the RDD as non-persistent, and remove all blocks for it from memory and disk.
   * 刪除持久化RDD,同時刪除內存和硬盤
   * @param blocking Whether to block until all blocks are deleted.
   * @return This RDD.
   */
  def unpersist(blocking: Boolean = true): this.type = {
    logInfo("Removing RDD " + id + " from persistence list")
    sc.unpersistRDD(id, blocking)
    storageLevel = StorageLevel.NONE  //將存儲級別設置爲None
    this
  }

  /** 
   *  Get the RDD's current storage level, or StorageLevel.NONE if none is set.
   *  獲得RDD當前的存儲級別 
   *  */
  def getStorageLevel: StorageLevel = storageLevel

下面是我們的RDD的Iterator方法的解析如下:

  /**
   * Task的執行起點,計算由此開始
   * Internal method to this RDD; will read from cache if applicable(可用), or otherwise(否則) compute it.
   * This should ''not'' be called by users directly, but is available for implementors of custom
   * subclasses of RDD.
   * RDD內部方法,從緩存中讀取可用,如果沒有則計算它,
   */
    final def iterator(split: Partition, context: TaskContext): Iterator[T] = {
      //如果存儲級別不是NONE 那麼先檢查是否有緩存,沒有緩存則要進行計算
    if (storageLevel != StorageLevel.NONE) {
      getOrCompute(split, context)
      //SparkEnv包含運行時節點所需要的環境信息
      //cacheManager負責調用BlockManager來管理RDD的緩存,如果當前RDD原來計算過並且把結果緩存起來.
      //接下來的運行都可以通過BlockManager來直接讀取緩存後返回
    } else {
    //如果沒有緩存,存在檢查點時直接獲取中間結果
      computeOrReadCheckpoint(split, context)
    }
  }

getOrComputey方法如下:

  /** Gets or computes an RDD partition. Used by RDD.iterator() when an RDD is cached. 
   *  獲取或計算一個RDD的分區,當RDD被緩存時由RDD.iterator()使用
   *  */

  private[spark] def getOrCompute(partition: Partition, context: TaskContext): Iterator[T] = {
    //通過RDD的編號和Partition的序號獲取到數據塊Block的編號
    val blockId = RDDBlockId(id, partition.index)
    var readCachedBlock = true
    // This method is called on executors, so we need call SparkEnv.get instead of sc.env
    //因爲這個方法是由executor來調用這個方法,所以可以使用SparkEnv代替sc.env
    //先根據數據塊BlockId來讀取數據,然後更新數據,這個方法是讀寫數據的入口
    SparkEnv.get.blockManager.getOrElseUpdate(blockId, storageLevel, elementClassTag, 
    //如果是數據塊不存在,則嘗試讀取檢查點的結果進行迭代計算
    () => {
      readCachedBlock = false
      computeOrReadCheckpoint(partition, context)
    }) match {
      case Left(blockResult) =>
        if (readCachedBlock) {
          val existingMetrics = context.taskMetrics().inputMetrics
          existingMetrics.incBytesRead(blockResult.bytes)
          new InterruptibleIterator[T](context, blockResult.data.asInstanceOf[Iterator[T]]) {
            override def next(): T = {
              existingMetrics.incRecordsRead(1)
              delegate.next()
            }
          }
        } else {
          new InterruptibleIterator(context, blockResult.data.asInstanceOf[Iterator[T]])
        }
      case Right(iter) =>
        new InterruptibleIterator(context, iter.asInstanceOf[Iterator[T]])
    }
  }

下面是我們的getOrElseUpdate方法的解析如下:

  /**
   *這個方法是我們的Spark存寫數據的入口點,同時這個方法由我們的Executor調用
   * Retrieve the given block if it exists, otherwise call the provided `makeIterator` method
   * to compute the block, persist it, and return its values.
   *
   * @return either a BlockResult if the block was successfully cached, or an iterator if the block
   *         could not be cached.
   */
  def getOrElseUpdate[T](
      blockId: BlockId,
      level: StorageLevel,
      classTag: ClassTag[T],
      makeIterator: () => Iterator[T]): Either[BlockResult, Iterator[T]] = {
    // Attempt to read the block from local or remote storage. If it's present, then we don't need
    // to go through the local-get-or-put path.
    //讀取數據的入口點,嘗試從本地或者遠程讀取數據
    get[T](blockId)(classTag) match {
      case Some(block) =>
        return Left(block)
      case _ =>
        // Need to compute the block.
    }
    // Initially we hold no locks on this block.
    //寫數據入口
    doPutIterator(blockId, makeIterator, level, classTag, keepReadLock = true) match {
      case None =>
        // doPut() didn't hand work back to us, so the block already existed or was successfully
        // stored. Therefore, we now hold a read lock on the block.
        val blockResult = getLocalValues(blockId).getOrElse {
          // Since we held a read lock between the doPut() and get() calls, the block should not
          // have been evicted, so get() not returning the block indicates some internal error.
          releaseLock(blockId)
          throw new SparkException(s"get() failed for block $blockId even though we held a lock")
        }
        // We already hold a read lock on the block from the doPut() call and getLocalValues()
        // acquires the lock again, so we need to call releaseLock() here so that the net number
        // of lock acquisitions is 1 (since the caller will only call release() once).
        releaseLock(blockId)
        Left(blockResult)
      case Some(iter) =>
        // The put failed, likely because the data was too large to fit in memory and could not be
        // dropped to disk. Therefore, we need to pass the input iterator back to the caller so
        // that they can decide what to do with the values (e.g. process them without caching).
       Right(iter)
    }
  }
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章