字節跳動大數據面試題:講講Spark資源調度和任務的源碼

其實 Spark 在某種層面上和 MR 是一樣的,只不過 Spark 更適合於迭代計算的業務。那麼二者的區別有哪些呢?我這裏只列舉幾個大層面的東西,細小的東西我們會在下文繼續分析:

  • 當內存足夠的情況下或者數據量相對較小的情況下,Spark RDD之間傳輸的數據不用上傳 HDFS
  • Spark 可以將 RDD 存儲在內存,供下一個 task 使用
  • Spark 適用於迭代計算,如果不是迭代相關的業務問題,Spark 於 MR 的性能差不多

 

這裏對 Spark 從集羣啓動到最後的 task 執行完畢並返回之行結果的過程做了一個概述,下面首先從資源調度開始分析。

Spark 資源調度源碼解析

一、Spark 集羣啓動:

首先啓動 Spark 集羣,查看 sbin 下的 start-all.sh 腳本,會先啓動 Master:

# Start Master
"${SPARK_HOME}/sbin"/start-master.sh

# Start Workers
"${SPARK_HOME}/sbin"/start-slaves.sh

查看 sbin/start-master.sh 腳本,發現會去執行 org.apache.spark.deploy.master.Master 類,開始在源碼中跟進Master,從 main 方法開始:

//主方法
def main(argStrings: Array[String]) {
  Thread.setDefaultUncaughtExceptionHandler(new SparkUncaughtExceptionHandler(
    exitOnUncaughtException = false))
  Utils.initDaemon(log)
  val conf = new SparkConf
  val args = new MasterArguments(argStrings, conf)
  /**
    * 創建RPC 環境和Endpoint (RPC 遠程過程調用),在Spark中 Driver, Master ,Worker角色都有各自的Endpoint,相當於各自的通信郵箱。
    *
    */
  val (rpcEnv, _, _) = startRpcEnvAndEndpoint(args.host, args.port, args.webUiPort, conf)
  rpcEnv.awaitTermination()
}

在 main 方法中執行了 startRpcEnvAndEndpoint 方法,創建 RPC 環境和 Endpoint(RPC 遠程過程調用),詳細的說 RpcEnv 是用於接收消息和處理消息的遠程通信調用環境,Master 向 RpcEnv 中去註冊,不管是 Master,Driver,Worker,Executor 等都有自己的 Endpoint,相當於是郵箱,其他人想跟我通信先要找到我的郵箱纔可以。Master啓動時會將 Endpoint 註冊在 RpcEnv 裏面,用於接收,處理消息。跟進 startRpcEnvAndEndpoint 方法:

/**
  * 創建RPC(Remote Procedure Call )環境  ,Remote Procedure Call
  * 這裏只是創建準備好Rpc的環境,後面會向RpcEnv中註冊 角色【Driver,Master,Worker,Executor】
  */
val rpcEnv = RpcEnv.create(SYSTEM_NAME, host, port, conf, securityMgr)
/**
  * 向RpcEnv 中 註冊Master
  *
  * rpcEnv.setupEndpoint(name,new Master)
  * 這裏new Master 的Master 是一個伴生類,繼承了 ThreadSafeRpcEndpoint,歸根結底繼承到了 Trait 接口  RpcEndpoint
  * 什麼是Endpoint?
  *  EndPoint中存在
  *     onstart() :啓動當前Endpoint
  *     receive() :負責收消息
  *     receiveAndReply():接受消息並回復
  *  Endpoint 還有各自的引用,方便其他Endpoint發送消息,直接引用對方的EndpointRef 即可找到對方的Endpoint
  *  以下 masterEndpoint 就是Master的Endpoint引用 RpcEndpointRef 。
  * RpcEndpointRef中存在:
  *     send():發送消息
  *     ask() :請求消息,並等待迴應。
  */
val masterEndpoint: RpcEndpointRef = rpcEnv.setupEndpoint(ENDPOINT_NAME,
  new Master(rpcEnv, rpcEnv.address, webUiPort, securityMgr, conf))
val portsResponse = masterEndpoint.askSync[BoundPortsResponse](BoundPortsRequest)
(rpcEnv, portsResponse.webUIPort, portsResponse.restPort)

在該方法有兩個功能:

  • 創建 RpcEnv,一些細節已在代碼中說明
  • 向RpcEnv 中 註冊Master,一些細節已在代碼中說明

繼續跟進 create 方法:

//創建RPC 環境
create(name, host, host, port, conf, securityManager, 0, clientMode)

繼續跟進 create 方法:

//創建NettyRpc 環境
new NettyRpcEnvFactory().create(config)

在這裏會創建一個 NettyRpcEnvFactory 的對象,並調用 create 方法,看一下在 實例化 NettyRpcEnvFactory 時有哪些操作:

/**
  * 創建nettyRPC通信環境。
  */
val nettyEnv =
  new NettyRpcEnv(sparkConf, javaSerializerInstance, config.advertiseAddress,
    config.securityManager, config.numUsableCores)

該方法的作用是創建nettyRPC通信環境,並且在 new NettyRpcEnv 時會做一些初始化:

  • Dispatcher:這個對象中有存放消息的隊列和消息的轉發
  • TransportContext:可以創建NettyRpcHandler
private val dispatcher: Dispatcher = new Dispatcher(this, numUsableCores)
private val transportContext = new TransportContext(transportConf,
  new NettyRpcHandler(dispatcher, this, streamManager))

Dispatcher 的作用是存放消息的隊列和消息的轉發,首先看 Dispatcher 實例化時執行的邏輯:

private val threadpool: ThreadPoolExecutor = {
  val availableCores =
    if (numUsableCores > 0) numUsableCores else Runtime.getRuntime.availableProcessors()
  val numThreads = nettyEnv.conf.getInt("spark.rpc.netty.dispatcher.numThreads",
    math.max(2, availableCores))
  val pool = ThreadUtils.newDaemonFixedThreadPool(numThreads, "dispatcher-event-loop")
  for (i <- 0 until numThreads) {
    pool.execute(new MessageLoop)
  }
  pool
}

在 Dispatcher 實例化的過程中會創建一個 threadpool,在 threadpool 中會執行 MessageLoop:

private class MessageLoop extends Runnable {
  override def run(): Unit = {
    try {
      while (true) {
        try {
          //take 出來消息一直處理
          val data: EndpointData = receivers.take()
          if (data == PoisonPill) {
            // Put PoisonPill back so that other MessageLoops can see it.
            receivers.offer(PoisonPill)
            return
          }
          //調用process 方法處理消息
          data.inbox.process(Dispatcher.this)
        } catch {
          case NonFatal(e) => logError(e.getMessage, e)
        }
      }
    } catch {
      case ie: InterruptedException => // exit
    }
  }
}

在 MessageLoop 中的 receivers.take() 會一直向 receivers消息隊列中去數據,而 receivers 是在 Dispatcher 初始化時創建的,至此接收消息的程序已經啓動起來:

private val receivers = new LinkedBlockingQueue[EndpointData]

其中會傳入一個 EndpointData 對象,實例化時會實例化一個 Inbox 對象:

private class EndpointData(
    val name: String,
    val endpoint: RpcEndpoint,
    val ref: NettyRpcEndpointRef) {
  //將endpoint封裝到Inbox中
  val inbox = new Inbox(ref, endpoint)
}

實例化 Inbox 對象,當註冊 endpoint 時都會調用一個異步方法,messages中放一個OnStart樣例類(消息隊列),所以早默認情況下都會調用 OnStart 的匹配方法:

inbox.synchronized {
  messages.add(OnStart)
}

在實例化 MessageLoop 時還會調用 process 方法處理消息:

//調用process 方法處理消息
data.inbox.process(Dispatcher.this)

在 process 中就會找到與發送消息所匹配的 case 去執行邏輯,例如:

case OnStart =>
  //調用Endpoint 的onStart方法
  endpoint.onStart()
  if (!endpoint.isInstanceOf[ThreadSafeRpcEndpoint]) {
    inbox.synchronized {
      if (!stopped) {
        enableConcurrent = true
      }
    }
  }

下面來分析 transportContext 對象的作用,在創建完該類的實例之後,會調用 transportContext.createServer 方法去啓動 NettyRpc 的服務,在創建 Rpc 服務的過程中,會創建將處理消息的對象 createChannelHandler:

server = transportContext.createServer(bindAddress, port, bootstraps)

在 createServer 方法中會實例化 TransportServer 的對象,在 try 中會調用 init 方法進行初始化:

try {
  //運行初始化init方法
  init(hostToBind, portToBind);
}

在 init 方法中,Rpc 的遠程通信對象 bootstrap 會調用 childHandler 方法,會初始化網絡通信管道:

//初始化網絡通信管道
context.initializePipeline(ch, rpcHandler);

在初始化網絡通信管道的過程中,創建處理消息的 channelHandler 對象,該對象的作用是

  • 創建並處理客戶端的請求消息和服務消息
//創建處理消息的 channelHandler
TransportChannelHandler channelHandler = createChannelHandler(channel, channelRpcHandler);
private TransportChannelHandler createChannelHandler(Channel channel, RpcHandler rpcHandler) {
  TransportResponseHandler responseHandler = new TransportResponseHandler(channel);
  TransportClient client = new TransportClient(channel, responseHandler);
  TransportRequestHandler requestHandler = new TransportRequestHandler(channel, client,
    rpcHandler, conf.maxChunksBeingTransferred());

  return new TransportChannelHandler(client, responseHandler, requestHandler,
    conf.connectionTimeoutMs(), closeIdleConnections);
}

TransportChannelHandler 由以上 responseHandler,client,requestHandler 三個 handler 構建,並且這個對象中有 channelRead 方法,用於讀取接收到的消息:

@Override
public void channelRead(ChannelHandlerContext ctx, Object request) throws Exception {
  //判斷當前消息是請求的消息還是迴應的消息
  if (request instanceof RequestMessage) {
    requestHandler.handle((RequestMessage) request);
  } else if (request instanceof ResponseMessage) {
    responseHandler.handle((ResponseMessage) request);
  } else {
    ctx.fireChannelRead(request);
  }
}

以 requestHandler 爲例,調用 headle —> processRpcRequest((RpcRequest) request),會看到 rpcHandler.receive,此時調用的是 NettyRpcHandler 的 receive:

try {
  /**
   *  rpcHandler 是一直傳過來的 NettyRpcHandler
   *  這裏的receive 方法 是 NettyRpcHandler 中的方法
   */
  rpcHandler.receive(reverseClient, req.body().nioByteBuffer(), new RpcResponseCallback() {
    @Override
    public void onSuccess(ByteBuffer response) {
      respond(new RpcResponse(req.requestId, new NioManagedBuffer(response)));
    }

    @Override
    public void onFailure(Throwable e) {
      respond(new RpcFailure(req.requestId, Throwables.getStackTraceAsString(e)));
    }
  });
}

------------------------------------------------------------------------

override def receive(
  client: TransportClient,
  message: ByteBuffer,
  callback: RpcResponseCallback): Unit = {
  val messageToDispatch = internalReceive(client, message)
  //dispatcher負責發送遠程的消息,都最終調到postMessage 方法
  dispatcher.postRemoteMessage(messageToDispatch, callback)
}

繼續調用 dispatcher.postRemoteMessage 方法:

def postRemoteMessage(message: RequestMessage, callback: RpcResponseCallback): Unit = {
  val rpcCallContext =
    new RemoteNettyRpcCallContext(nettyEnv, callback, message.senderAddress)
  val rpcMessage = RpcMessage(message.senderAddress, message.content, rpcCallContext)
  postMessage(message.receiver.name, rpcMessage, (e) => callback.onFailure(e))
}

在 postRemoteMessage 中,無論是請求消息還是迴應消息,都最終會執行到這個 postMessage:

private def postMessage(
    endpointName: String,
    message: InboxMessage,
    callbackIfStopped: (Exception) => Unit): Unit = {
  val error = synchronized {
    //獲取消息的通信郵箱名稱
    val data = endpoints.get(endpointName)
    if (stopped) {
      Some(new RpcEnvStoppedException())
    } else if (data == null) {
      Some(new SparkException(s"Could not find $endpointName."))
    } else {
      //將消息放入通信端的消息隊列中
      data.inbox.post(message)
      //添加到消息隊列中
      receivers.offer(data)
      None
    }
  }

在該方法中會將 message 放入 inbox 中,在處理消息的程序啓動後,處理消息的程序也已經啓動,至此 RpcEnv環境啓動完畢,緊接着 Master 要在RpcEnv 中註冊:

val masterEndpoint: RpcEndpointRef = rpcEnv.setupEndpoint(ENDPOINT_NAME,
  new Master(rpcEnv, rpcEnv.address, webUiPort, securityMgr, conf))

Master 註冊了自己 EndPoint,可以接受處理消息,Master啓動成功。

Worker 同理,Spark 集羣啓動成功。

二、Spark 提交任務向 Master 申請啓動 Driver

執行 ./spark-submit…,找到 spark-submit.sh 腳本,找到SparkSubmit 主類:

//提交任務主類運行
override def main(args: Array[String]): Unit = {
  // Initialize logging if it hasn't been done yet. Keep track of whether logging needs to
  // be reset before the application starts.
  val uninitLog = initializeLogIfNecessary(true, silent = true)
  //設置參數
  val appArgs = new SparkSubmitArguments(args)
  if (appArgs.verbose) {
    // scalastyle:off println
    printStream.println(appArgs)
    // scalastyle:on println
  }
  appArgs.action match {
    //任務提交匹配 submit
    case SparkSubmitAction.SUBMIT => submit(appArgs, uninitLog)
    case SparkSubmitAction.KILL => kill(appArgs)
    case SparkSubmitAction.REQUEST_STATUS => requestStatus(appArgs)
  }
}

因爲我們是提交任務,所以會匹配到 SparkSubmitAction.SUBMIT 的類型,繼續執行 submit 方法。

val (childArgs, childClasspath, sparkConf, childMainClass) = prepareSubmitEnvironment(args)

在 submit 中,prepareSubmitEnvironment 會返回一個四元組,重點注意 childMainClass 類,是最後啓動 Driver 的類,這裏以 standalone-cluster 爲例。

//準備提交任務的環境
doPrepareSubmitEnvironment(args, conf)

進入 prepareSubmitEnvironment,找到準備提交任務的環境的方法 doPrepareSubmitEnvironment:

//正常提交方式
// In legacy standalone cluster mode, use Client as a wrapper around the user class
childMainClass = STANDALONE_CLUSTER_SUBMIT_CLASS

因爲這裏是以 standalone-cluster 爲例,所以找到匹配的case,並進入 STANDALONE_CLUSTER_SUBMIT_CLASS 看看代表什麼:

private[deploy] val STANDALONE_CLUSTER_SUBMIT_CLASS = classOf[ClientApp].getName()

這裏的 clientApp 就是我們要啓動的 Driver 部分。

在 submit 方法繼續向下走,因爲方法不會直接執行,所以代碼向下走回執行到 doRunMain():

//運行
runMain(childArgs, childClasspath, sparkConf, childMainClass, args.verbose)

繼續執行 runMain 方法,這裏將 childMainClass 作爲參數傳了進來:

//加載類
mainClass = Utils.classForName(childMainClass)

//------------------------------------------------------------

val app: SparkApplication = if (classOf[SparkApplication].isAssignableFrom(mainClass))

這裏首先將 childMainClass 加載出來,賦給變量 mainClass,之後會將 mainClass 映射成 SparkApplication。

app.start(childArgs.toArray, sparkConf)

接下來調用start方法,這裏調用的是 ClientApp 的 start 方法,因爲 childMainClass 是 clientApp 的類型:

rpcEnv.setupEndpoint("client", new ClientEndpoint(rpcEnv, driverArgs, masterEndpoints, conf))

在 rpc 中設置提交當前任務的 Endpoint,只要設置肯定會運行 new ClientEndpoint 類的 onStart 方法:

sval mainClass = "org.apache.spark.deploy.worker.DriverWrapper"
//將DriverWrapper 這個類封裝到Command中
val command = new Command(mainClass,
Seq("{{WORKER_URL}}", "{{USER_JAR}}", driverArgs.mainClass) ++ driverArgs.driverOptions,
sys.env, classPathEntries, libraryPathEntries, javaOpts)

val driverDescription = new DriverDescription(
driverArgs.jarUrl,
driverArgs.memory,
driverArgs.cores,
driverArgs.supervise,
command)
//向Master申請啓動Driver,Master中的 receiveAndReply 方法會接收此請求消息
asyncSendToMasterAndForwardReply[SubmitDriverResponse](
RequestSubmitDriver(driverDescription))

這裏將 org.apache.spark.deploy.worker.DriverWrapper 封裝成 command,並且將 command 封裝到 driverDescription 中,然後向 Master 申請啓動 Driver,Master 中的 receiveAndReply 方法會接收此請求消息:

case RequestSubmitDriver(description) =>
  //判斷Master狀態
  if (state != RecoveryState.ALIVE) {
    val msg = s"${Utils.BACKUP_STANDALONE_MASTER_PREFIX}: $state. " +
      "Can only accept driver submissions in ALIVE state."
    context.reply(SubmitDriverResponse(self, false, None, msg))
  } else {
    logInfo("Driver submitted " + description.command.mainClass)
    val driver = createDriver(description)
    persistenceEngine.addDriver(driver)
    waitingDrivers += driver
    drivers.add(driver)
    schedule()

    // TODO: It might be good to instead have the submission client poll the master to determine
    //       the current status of the driver. For now it's simply "fire and forget".

    context.reply(SubmitDriverResponse(self, true, Some(driver.id),
      s"Driver successfully submitted as ${driver.id}"))
  }

這裏首先會判斷 Master 的狀態,如果符合要求的話,會使用之前封裝的 description 對象創建 driver,driver 其實是一個 DriverInfo 的類型,裏面封裝了一些 Driver 的信息。之後會在 waitingDrivers (private val waitingDrivers = new ArrayBuffer[DriverInfo]) 中添加剛纔創建完的 DriverInfo對象,進入 schedule() 調度方法:

/**
  * schedule() 方法是通用的方法
  * 這個方法中當申請啓動Driver的時候也會執行,但是最後一行的startExecutorsOnWorkers 方法中 waitingApp是空的,只是啓動Driver。
  * 在提交application時也會執行到這個scheduler方法,這個時候就是要啓動的Driver是空的,但是會直接運行startExecutorsOnWorkers 方法給當前的application分配資源
  *
  */
private def schedule(): Unit = {
  //判斷Master狀態
  if (state != RecoveryState.ALIVE) {
    return
  }
  // Drivers take strict precedence over executors 這裏是打散worker
  val shuffledAliveWorkers = Random.shuffle(workers.toSeq.filter(_.state == WorkerState.ALIVE))
  //可用的worker數量
  val numWorkersAlive = shuffledAliveWorkers.size
  var curPos = 0
  for (driver <- waitingDrivers.toList) { // iterate over a copy of waitingDrivers
    // We assign workers to each waiting driver in a round-robin fashion. For each driver, we
    // start from the last worker that was assigned a driver, and continue onwards until we have
    // explored all alive workers.
    var launched = false
    var numWorkersVisited = 0
    while (numWorkersVisited < numWorkersAlive && !launched) {
      //拿到curPos位置的worker
      val worker = shuffledAliveWorkers(curPos)
      numWorkersVisited += 1
      if (worker.memoryFree >= driver.desc.mem && worker.coresFree >= driver.desc.cores) {
        //這裏是啓動Driver,啓動Driver之後會爲當前的application 申請資源
        launchDriver(worker, driver)
        waitingDrivers -= driver
        launched = true
      }
      //curPos 就是一直加一的往後取 Worker  ,一直找到滿足資源的worker
      curPos = (curPos + 1) % numWorkersAlive
    }
  }
  startExecutorsOnWorkers()
}

這裏會找到滿足條件的 Worker 節點去啓動 Driver,調用 launchDriver(worker, driver) 方法:

private def launchDriver(worker: WorkerInfo, driver: DriverInfo) {
  logInfo("Launching driver " + driver.id + " on worker " + worker.id)
  worker.addDriver(driver)
  driver.worker = Some(worker)
  //給Worker發送消息啓動Driver,這裏在Worker中會有receive方法一直匹配LaunchDriver
  worker.endpoint.send(LaunchDriver(driver.id, driver.desc))
  driver.state = DriverState.RUNNING
}

這裏會向 Worker 發送消息啓動 Driver,這裏在 Worker 中會有 receive 方法一直匹配 LaunchDriver:

case LaunchDriver(driverId, driverDesc) =>
  logInfo(s"Asked to launch driver $driverId")
  val driver = new DriverRunner(
    conf,
    driverId,
    workDir,
    sparkHome,
    driverDesc.copy(command = Worker.maybeUpdateSSLSettings(driverDesc.command, conf)),
    self,
    workerUri,
    securityMgr)
  drivers(driverId) = driver
  //啓動Driver,會初始化 org.apache.spark.deploy.worker.DriverWrapper ,運行main方法
  driver.start()

這裏說的啓動的 Driver 就是剛纔說的 val mainClass = "org.apache.spark.deploy.worker.DriverWrapper",Driver 啓動的就是 DriverWrapper 類的啓動,DriverWrapper 的啓動就是在 Worker 中創建一個 Driver 進程。啓動Driver,會初始化 org.apache.spark.deploy.worker.DriverWrapper,運行 main 方法:

//下面的mainClass就是我們真正提交的application
case workerUrl :: userJar :: mainClass :: extraArgs =>

在 DriverWrapper 的 main 方法中的 mainClass,就是我們真正提交的 Application

// Delegate to supplied main class
val clazz = Utils.classForName(mainClass)
//得到提交application的主方法
val mainMethod = clazz.getMethod("main", classOf[Array[String]])

/**
  * 啓動提交的application 中的main 方法。
  * 這裏啓動application,會先創建SparkConf和SparkContext
  *   SparkContext中 362行try塊中會創建TaskScheduler(492)
  */
mainMethod.invoke(null, extraArgs.toArray[String])

三、Spark Driver 啓動完畢,並且向 Master 註冊 Applciation

當在 Worker 啓動完 Driver 之後,程序走到在 Driver 端我們手寫的代碼上,首先就是啓動 SparkContext,在與源碼中找到 SparkContext 類,創建 Scala 對象時只有方法不執行,剩下的都執行,找到如下代碼:

/**
  * 啓動調度程序,這裏(sched,ts) 分別對應 StandaloneSchedulerBackend 和 TaskSchedulerImpl 兩個對象
  *   master 是提交任務寫的 spark://node1:7077
  */
  val (sched, ts) = SparkContext.createTaskScheduler(this, master, deployMode)
  _schedulerBackend = sched
  _taskScheduler = ts
  _dagScheduler = new DAGScheduler(this)
  _heartbeatReceiver.ask[Boolean](TaskSchedulerIsSet)

// start TaskScheduler after taskScheduler sets DAGScheduler reference in DAGScheduler's
// constructor
//TaskSchedulerImpl 對象的start方法
_taskScheduler.start()

我們能看到 createTaskScheduler 方法返回一個元組,且在 SparkContext 中已經創建了 DAGScheduler對象,進入 createTaskScheduler 方法:

//standalone 提交任務都是以 “spark://”這種方式提交
case SPARK_REGEX(sparkUrl) =>
  //scheduler 創建TaskSchedulerImpl 對象
  val scheduler = new TaskSchedulerImpl(sc)
  val masterUrls = sparkUrl.split(",").map("spark://" + _)
  /**
          * 這裏的 backend 是StandaloneSchedulerBackend 這個類型
          */
  val backend = new StandaloneSchedulerBackend(scheduler, sc, masterUrls)
  //這裏會調用 TaskSchedulerImpl 對象中的  initialize 方法將 backend 初始化,一會要用到
  scheduler.initialize(backend)
  //返回了  StandaloneSchedulerBackend 和 TaskSchedulerImpl 兩個對象
  (backend, scheduler)

因爲是以Standalone的方式提交任務,所以找到匹配的 case。在這裏創建了 TaskSchedulerImpl 的對象scheduler,並將其傳入了 StandaloneSchedulerBackend(scheduler, sc, masterUrls)中,得知,返回的 backend 是一個 Standalone 大環境的任務調度器,scheduler 則是 TaskScheduler 的調度器。

/*
  * 在TaskScheduler中設置進來backend ,
  *  這裏的backend 是 StandaloneSchedulerBackend
  *  StandaloneSchedulerBackend 繼承了CoarseGrainedSchedulerBackend 類
  */
  def initialize(backend: SchedulerBackend) {
    this.backend = backend
    schedulableBuilder = {
      schedulingMode match {
        case SchedulingMode.FIFO =>
          new FIFOSchedulableBuilder(rootPool)
        case SchedulingMode.FAIR =>
          new FairSchedulableBuilder(rootPool, conf)
        case _ =>
          throw new IllegalArgumentException(s"Unsupported $SCHEDULER_MODE_PROPERTY: " +
          s"$schedulingMode")
      }
    }
    schedulableBuilder.buildPools()
  }

在 scheduler.initialize(backend) 中的作用就是將 backend 設置爲 scheduler 的屬性,且StandaloneSchedulerBackend 繼承了CoarseGrainedSchedulerBackend 類。

//TaskSchedulerImpl 對象的start方法
_taskScheduler.start()

ts 調用的是 TaskSchedulerImpl 中的 start() 方法:

/**
  * TaskScheduler 啓動
  */
override def start() {
  //StandaloneSchedulerBackend 啓動
  backend.start()

    if (!isLocal && conf.getBoolean("spark.speculation", false)) {
      logInfo("Starting speculative execution thread")
        speculationScheduler.scheduleWithFixedDelay(new Runnable {
          override def run(): Unit = Utils.tryOrStopSparkContext(sc) {
            checkSpeculatableTasks()
          }
        }, SPECULATION_INTERVAL_MS, SPECULATION_INTERVAL_MS, TimeUnit.MILLISECONDS)
    }
}

進入 backend.start() 方法,因爲 backend 是 StandaloneSchedulerBackend 類型,所有調用的是 StandaloneSchedulerBackend 中的 start() 方法:

override def start() {
  /**
    * super.start()中有創建Driver的通信郵箱也就是Driver的引用
    * 未來Executor就是向 StandaloneSchedulerBackend中父類 CoarseGrainedSchedulerBackend 中反向註冊信息的.
    */
  super.start()

進入 super.start(),在方法中向 RpcEnv 註冊了 Driver 端的 Endpoint:

override def start() {
  val properties = new ArrayBuffer[(String, String)]
  for ((key, value) <- scheduler.sc.conf.getAll) {
    if (key.startsWith("spark.")) {
      properties += ((key, value))
    }
  }

  // TODO (prashant) send conf instead of properties
  /**
    * 創建Driver的Endpoint ,就是創建Driver的通信郵箱,向Rpc中註冊當前DriverEndpoint
    * 未來Executor就是向DriverEndpoint中反向註冊信息,這裏Driver中會有receiveAndReply方法一直監聽匹配發過來的信息
    */
  driverEndpoint = createDriverEndpointRef(properties)
}

注意在 backend.start() 方法執行創建完 DriverEndpoint 之後,還執行了如下代碼,功能是向 Driver 註冊application 的信息 :

val command = Command("org.apache.spark.executor.CoarseGrainedExecutorBackend", args, sc.executorEnvs, classPathEntries ++ testingClassPath, libraryPathEntries, javaOpts)
......
  val appDesc: ApplicationDescription = ApplicationDescription(sc.appName, maxCores,   sc.executorMemory, command, webUrl, sc.eventLogDir, sc.eventLogCodec, coresPerExecutor,    initialExecutorLimit)
  //提交應用程序的描述信息
  //封裝 appDesc,這裏已經傳入了StandaloneAppClient 中
  client = new StandaloneAppClient(sc.env.rpcEnv, masters, appDesc, this, conf)
  //啓動StandaloneAppClient,之後會向Driver註冊application的信息
  client.start()

進入 client.start() 之後就是給 rpcEnv 上註冊信息

def start() {
  // Just launch an rpcEndpoint; it will call back into the listener.
  /**
    *  這裏就是給空的 endpoint[AtomicReference] 設置下 信息,
    *  主要是rpcEnv.setupEndpoint 中創建了 ClientEndpoint 只要設置Endpoint 肯定會調用 ClientEndpoint的onStart方法
    */
  endpoint.set(rpcEnv.setupEndpoint("AppClient", new ClientEndpoint(rpcEnv)))
}

然後讀取 ClientEndpoint 的 onStart() 方法

//onStart 方法
override def onStart(): Unit = {
  try {
    //向Master 註冊當前application的信息
    registerWithMaster(1)
  } catch {
    case e: Exception =>
      logWarning("Failed to connect to master", e)
      markDisconnected()
      stop()
  }
}

向 Master 註冊當前application的信息的 registerWithMaster() 方法,因爲我們的 Master 會有高可用,所以要給所有的 Master 註冊,進入後看到 tryRegisterAllMasters():

private def registerWithMaster(nthRetry: Int) {
  //tryRegisterAllMasters 嘗試向所有的Master 註冊application信息
  registerMasterFutures.set(tryRegisterAllMasters())
  registrationRetryTimer.set(registrationRetryThread.schedule(new Runnable {
    override def run(): Unit = {
      if (registered.get) {
        registerMasterFutures.get.foreach(_.cancel(true))
        registerMasterThreadPool.shutdownNow()
      } else if (nthRetry >= REGISTRATION_RETRIES) {
        markDead("All masters are unresponsive! Giving up.")
      } else {
        registerMasterFutures.get.foreach(_.cancel(true))
        registerWithMaster(nthRetry + 1)
      }
    }
  }, REGISTRATION_TIMEOUT_SECONDS, TimeUnit.SECONDS))
}

進入 tryRegisterAllMasters() 之後,會獲取 Master 的 Endpoint,向 Master 註冊 application,Master 類中receive 方法中會匹配接收 RegisterApplication 類型,隨機在 Master 中匹配 RegisterApplication 的 case:

//向所有的Master去註冊Application的信息
private def tryRegisterAllMasters(): Array[JFuture[_]] = {
  //遍歷所有的Master地址
  for (masterAddress <- masterRpcAddresses) yield {
    registerMasterThreadPool.submit(new Runnable {
      override def run(): Unit = try {
        if (registered.get) {
          return
        }
        logInfo("Connecting to master " + masterAddress.toSparkURL + "...")
        //獲取Master的通信郵箱
        val masterRef = rpcEnv.setupEndpointRef(masterAddress, Master.ENDPOINT_NAME)
        //向Master註冊application,Master類中receive方法中會匹配接收 RegisterApplication類型
        masterRef.send(RegisterApplication(appDescription, self))
      } catch {
        case ie: InterruptedException => // Cancelled
        case NonFatal(e) => logWarning(s"Failed to connect to master $masterAddress", e)
      }
    })
  }
}

Master 匹配 RegisterApplication 類型消息的處理流程:

//Driver 端提交過來的要註冊Application
case RegisterApplication(description, driver) =>
  // TODO Prevent repeated registrations from some driver
  //如果Master狀態是standby 忽略不提交任務
  if (state == RecoveryState.STANDBY) {
    // ignore, don't send response
  } else {
    logInfo("Registering app " + description.name)
    //這裏封裝application信息,注意,在這裏可以跟進去看到默認一個application使用的core的個數就是 Int.MaxValue
    val app = createApplication(description, driver)
    //註冊app ,這裏面有向 waitingApps中加入當前application
    registerApplication(app)
    logInfo("Registered app " + description.name + " with ID " + app.id)
    persistenceEngine.addApplication(app)
    driver.send(RegisteredApplication(app.id, self))
    //最終又會執行通用方法schedule()
    schedule()
  }

至此 Driver 向 Master 註冊 Application 流程結束。

四、Master 發送消息啓動 Executor

在執行 schedule() 之前會向 Driver(StandaloneAppClient) 端發送 接收到Application已經被註冊 的消息,最終又會執行通用方法 schedule() 方法:

/**
  * schedule() 方法是通用的方法
  * 這個方法中當申請啓動Driver的時候也會執行,但是最後一行的startExecutorsOnWorkers 方法中 waitingApp是空的,只是啓動Driver。
  * 在提交application時也會執行到這個scheduler方法,這個時候就是要啓動的Driver是空的,但是會直接運行startExecutorsOnWorkers 方法給當前的application分配資源
  *
  */
private def schedule(): Unit = {
  //判斷Master狀態
  if (state != RecoveryState.ALIVE) {
    return
  }
  // Drivers take strict precedence over executors 這裏是打散worker
  val shuffledAliveWorkers = Random.shuffle(workers.toSeq.filter(_.state == WorkerState.ALIVE))
  //可用的worker數量
  val numWorkersAlive = shuffledAliveWorkers.size
  var curPos = 0
  for (driver <- waitingDrivers.toList) { // iterate over a copy of waitingDrivers
    // We assign workers to each waiting driver in a round-robin fashion. For each driver, we
    // start from the last worker that was assigned a driver, and continue onwards until we have
    // explored all alive workers.
    var launched = false
    var numWorkersVisited = 0
    while (numWorkersVisited < numWorkersAlive && !launched) {
      //拿到curPos位置的worker
      val worker = shuffledAliveWorkers(curPos)
      numWorkersVisited += 1
      if (worker.memoryFree >= driver.desc.mem && worker.coresFree >= driver.desc.cores) {
        //這裏是啓動Driver,啓動Driver之後會爲當前的application 申請資源
        launchDriver(worker, driver)
        waitingDrivers -= driver
        launched = true
      }
      //curPos 就是一直加一的往後取 Worker  ,一直找到滿足資源的worker
      curPos = (curPos + 1) % numWorkersAlive
    }
  }
  startExecutorsOnWorkers()
}

此時我們又回到了 schedule() 方法,之前是從 launchDriver(worker, driver) 進去的,現在又出來繼續調用 startExecutorsOnWorkers() 方法:

private def startExecutorsOnWorkers(): Unit = {
  // Right now this is a very simple FIFO scheduler. We keep trying to fit in the first app
  // in the queue, then the second app, etc.
  //從waitingApps中獲取提交的app
  for (app <- waitingApps) {
    //coresPerExecutor 在application中獲取啓動一個Executor使用幾個core 。參數--executor-core可以指定,下面指明不指定就是1
    val coresPerExecutor = app.desc.coresPerExecutor.getOrElse(1)
    // If the cores left is less than the coresPerExecutor,the cores left will not be allocated
    //判斷是否給application分配夠了core,因爲後面每次給application 分配core後 app.coresLeft 都會相應的減去分配的core數
    if (app.coresLeft >= coresPerExecutor) {
      // Filter out workers that don't have enough resources to launch an executor
      //過濾出可用的worker
      val usableWorkers = workers.toArray.filter(_.state == WorkerState.ALIVE)
        .filter(worker => worker.memoryFree >= app.desc.memoryPerExecutorMB &&
          worker.coresFree >= coresPerExecutor)
        .sortBy(_.coresFree).reverse

      /**
        * 下面就是去worker中劃分每個worker提供多少core和啓動多少Executor,注意:spreadOutApps 是true
        * 返回的 assignedCores 就是每個worker節點中應該給當前的application分配多少core
        */
      val assignedCores = scheduleExecutorsOnWorkers(app, usableWorkers, spreadOutApps)

      // Now that we've decided how many cores to allocate on each worker, let's allocate them
      for (pos <- 0 until usableWorkers.length if assignedCores(pos) > 0) {
        //在worker中給Executor劃分資源
        allocateWorkerResourceToExecutors(
          app, assignedCores(pos), app.desc.coresPerExecutor, usableWorkers(pos))
      }
    }
  }
}

在此處應該注意的是 scheduleExecutorsOnWorkers(app, usableWorkers, spreadOutApps) 方法,最後返回最後返回每個Worker上分配多少core,其他解釋已加註釋:

private def scheduleExecutorsOnWorkers(
      app: ApplicationInfo,
      usableWorkers: Array[WorkerInfo],
      spreadOutApps: Boolean): Array[Int] = {
    //啓動一個Executor使用多少core,這裏如果提交任務沒有指定 --executor-core這個值就是None
    val coresPerExecutor : Option[Int]= app.desc.coresPerExecutor
    //這裏指定如果提交任務沒有指定啓動一個Executor使用幾個core,默認就是1
    val minCoresPerExecutor = coresPerExecutor.getOrElse(1)
    //oneExecutorPerWorker 當前爲true
    val oneExecutorPerWorker :Boolean= coresPerExecutor.isEmpty
    //默認啓動一個Executor使用的內存就是1024M,這個設置在SparkContext中464行
    val memoryPerExecutor = app.desc.memoryPerExecutorMB
    //可用worker的個數
    val numUsable = usableWorkers.length
    //創建兩個重要對象
    val assignedCores = new Array[Int](numUsable) // Number of cores to give to each worker
    val assignedExecutors = new Array[Int](numUsable) // Number of new executors on each worker
    /**
      * coresToAssign 指的是當前要給Application分配的core是多少? app.coresLeft 與集羣所有worker剩餘的全部core 取個最小值
      * 這裏如果提交application時指定了 --total-executor-core 那麼app.coresLeft  就是指定的值
      */
    var coresToAssign = math.min(app.coresLeft, usableWorkers.map(_.coresFree).sum)

    /** Return whether the specified worker can launch an executor for this app. */
    //判斷某臺worker節點是否還可以啓動Executor
    def canLaunchExecutor(pos: Int): Boolean = {
      //可分配的core是否大於啓動一個Executor使用的1個core
      val keepScheduling = coresToAssign >= minCoresPerExecutor
      //是否有足夠的core
      val enoughCores = usableWorkers(pos).coresFree - assignedCores(pos) >= minCoresPerExecutor

      // If we allow multiple executors per worker, then we can always launch new executors.
      // Otherwise, if there is already an executor on this worker, just give it more cores.
      //assignedExecutors(pos) == 0 爲true,launchingNewExecutor就是爲true
      val launchingNewExecutor = !oneExecutorPerWorker || assignedExecutors(pos) == 0
      //啓動新的Executor
      if (launchingNewExecutor) {
        val assignedMemory = assignedExecutors(pos) * memoryPerExecutor
        //是否有足夠的內存
        val enoughMemory = usableWorkers(pos).memoryFree - assignedMemory >= memoryPerExecutor
        //這裏做安全判斷,說的是要分配啓動的Executor和當前application啓動的使用的Executor總數是否在Application總的Executor限制之下
        val underLimit = assignedExecutors.sum + app.executors.size < app.executorLimit
        keepScheduling && enoughCores && enoughMemory && underLimit
      } else {
        // We're adding cores to an existing executor, so no need
        // to check memory and executor limits
        keepScheduling && enoughCores
      }
    }

    // Keep launching executors until no more workers can accommodate any
    // more executors, or if we have reached this application's limits
//    var freeWorkers = (0 until numUsable).filter(one=>canLaunchExecutor(one))
    var freeWorkers = (0 until numUsable).filter(canLaunchExecutor)
    while (freeWorkers.nonEmpty) {
      freeWorkers.foreach { pos =>
        var keepScheduling = true
        while (keepScheduling && canLaunchExecutor(pos)) {
          coresToAssign -= minCoresPerExecutor
          assignedCores(pos) += minCoresPerExecutor

          // If we are launching one executor per worker, then every iteration assigns 1 core
          // to the executor. Otherwise, every iteration assigns cores to a new executor.
          if (oneExecutorPerWorker) {
            assignedExecutors(pos) = 1
          } else {
            assignedExecutors(pos) += 1
          }

          // Spreading out an application means spreading out its executors across as
          // many workers as possible. If we are not spreading out, then we should keep
          // scheduling executors on this worker until we use all of its resources.
          // Otherwise, just move on to the next worker.
          if (spreadOutApps) {
            keepScheduling = false
          }
        }
      }
      freeWorkers = freeWorkers.filter(canLaunchExecutor)
    }
    //最後返回每個Worker上分配多少core
    assignedCores
  }

在上述代碼中還有幾個注意點:

爲每個worker節點分配完資源後就開始啓動Executor了,接着執行 Master.schedule() 方法下的 allocateWorkerResourceToExecutors 方法,該方法傳入了當前 application,每個 worker 分配的核數,每個核上要啓動的 Executor 數量以及可用的 usableWorkers :

//在worker中給Executor劃分資源
allocateWorkerResourceToExecutors(
  app, assignedCores(pos), app.desc.coresPerExecutor, usableWorkers(pos))

點進入之後 Master 會計算每個 Executor 要分配多少個 core,緊接着去 worker 啓動Executor:

private def allocateWorkerResourceToExecutors(
    app: ApplicationInfo,
    assignedCores: Int,
    coresPerExecutor: Option[Int],
    worker: WorkerInfo): Unit = {
  // If the number of cores per executor is specified, we divide the cores assigned
  // to this worker evenly among the executors with no remainder.
  // Otherwise, we launch a single executor that grabs all the assignedCores on this worker.
  val numExecutors = coresPerExecutor.map { assignedCores / _ }.getOrElse(1)
  //每個Executor要分配多少個core
  val coresToAssign = coresPerExecutor.getOrElse(assignedCores)
  for (i <- 1 to numExecutors) {
    val exec: ExecutorDesc = app.addExecutor(worker, coresToAssign)
    //去worker中啓動Executor
    launchExecutor(worker, exec)
    app.state = ApplicationState.RUNNING
  }
}

在 launchExecutor(worker, exec) 方法中,會獲取 Worker 的通信郵箱,給 Worker 發送啓動 Executor 的消息,具體就是啓動Executor需要多少個 core 和 內存,然後在 Worker 中有 receive 方法一直匹配 LaunchExecutor 類型:

//啓動Executor
private def launchExecutor(worker: WorkerInfo, exec: ExecutorDesc): Unit = {
  logInfo("Launching executor " + exec.fullId + " on worker " + worker.id)
  worker.addExecutor(exec)

  /**
    *  獲取Worker的通信郵箱,給Worker發送啓動Executor【多少core和多少內存】
    *  在Worker中有receive 方法一直匹配 LaunchExecutor 類型
    */
  worker.endpoint.send(LaunchExecutor(masterUrl,
    exec.application.id, exec.id, exec.application.desc, exec.cores, exec.memory))
  exec.application.driver.send(
    ExecutorAdded(exec.id, worker.id, worker.hostPort, exec.cores, exec.memory))
}

在 worker 的處理方法中,會創建 ExecutorRunner 類對象,參數 appDesc 中有 Command("org.apache.spark.executor.CoarseGrainedExecutorBackend",…),其中第一個參數就是Executor類

//創建ExecutorRunner
val manager = new ExecutorRunner(
  appId,
  execId,
  /**
    * appDesc 中有 Command("org.apache.spark.executor.CoarseGrainedExecutorBackend",.......) 中
    * 第一個參數就是Executor類
    */
  appDesc.copy(command = Worker.maybeUpdateSSLSettings(appDesc.command, conf)),
  cores_,
  memory_,
  self,
  workerId,
  host,
  webUi.boundPort,
  publicAddress,
  sparkHome,
  executorDir,
  workerUri,
  conf,
  appLocalDirs, ExecutorState.RUNNING)

緊接着就是啓動通過 manager 啓動Executor,啓動的就是 CoarseGrainedExecutorBackend 類,下面看 CoarseGrainedExecutorBackend 類中的 main 方法有反向註冊給Driver:

//註冊Executor的通信郵箱,會調用CoarseGrainedExecutorBackend的onstart方法
env.rpcEnv.setupEndpoint("Executor", new CoarseGrainedExecutorBackend(
  env.rpcEnv, driverUrl, executorId, hostname, cores, userClassPath, env))

會調用 CoarseGrainedExecutorBackend 類的 onStart() 方法:

override def onStart() {
  logInfo("Connecting to driver: " + driverUrl)
  //從RPC中拿到Driver的引用,給Driver反向註冊Executor
  rpcEnv.asyncSetupEndpointRefByURI(driverUrl).flatMap { ref =>
    // This is a very fast action so we can use "ThreadUtils.sameThread"
    //拿到Driver的引用
    driver = Some(ref)
    /**
      * 給Driver反向註冊Executor信息,這裏就是註冊給之前看到的 CoarseGrainedSchedulerBackend 類中的DriverEndpoint
      * DriverEndpoint類中會有receiveAndReply 方法來匹配RegisterExecutor
      */
    ref.ask[Boolean](RegisterExecutor(executorId, self, hostname, cores, extractLogUrls))
  }(ThreadUtils.sameThread).onComplete {
    // This is a very fast action so we can use "ThreadUtils.sameThread"
    case Success(msg) =>
      // Always receive `true`. Just ignore it
    case Failure(e) =>
      exitExecutor(1, s"Cannot register with driver: $driverUrl", e, notifyDriver = false)
  }(ThreadUtils.sameThread)
}

在該方法中從 RPC 中拿到 Driver 的引用,將 Executor 反向註冊給 Driver,方法中的 ref 就是 CoarseGrainedSchedulerBackend 類的引用,之後在 CoarseGrainedSchedulerBackend 中找到匹配 RegisterExecutor 的 case,用於反向註冊:

/**
  * 拿到Execuotr的通信郵箱,發送消息給ExecutorRef 告訴 Executor已經被註冊。
  * 在 CoarseGrainedExecutorBackend 類中 receive方法一直監聽有沒有被註冊,匹配上就會啓動Executor
  *
  */
executorRef.send(RegisteredExecutor)

在 Driver 端告訴 Execuotr 端已經被註冊,匹配上就會啓動 Executor。去看 Execuotr 端匹配 RegisteredExecutor 的 case,用於啓動 Executor:

//匹配上Driver端發過來的消息,已經接受註冊Executor了,下面要啓動Executor
case RegisteredExecutor =>
  logInfo("Successfully registered with driver")
  try {
    //下面創建Executor,Executor真正的創建Executor,Executor中有線程池用於task運行
    executor = new Executor(executorId, hostname, env, userClassPath, isLocal = false)
  } catch {
    case NonFatal(e) =>
      exitExecutor(1, "Unable to create executor due to " + e.getMessage, e)
  }

Executor 在被告知反向註冊成功之後,開始真正的創建 Executor,Executor 中有線程池用於task運行:

/**
  * Executor 中的線程池
  */
private val threadPool = {
  val threadFactory = new ThreadFactoryBuilder()
    .setDaemon(true)
    .setNameFormat("Executor task launch worker-%d")
    .setThreadFactory(new ThreadFactory {
      override def newThread(r: Runnable): Thread =
        // Use UninterruptibleThread to run tasks so that we can allow running codes without being
        // interrupted by `Thread.interrupt()`. Some issues, such as KAFKA-1894, HADOOP-10622,
        // will hang forever if some methods are interrupted.
        new UninterruptibleThread(r, "unused") // thread name will be set by ThreadFactoryBuilder
    })
    .build()
  Executors.newCachedThreadPool(threadFactory).asInstanceOf[ThreadPoolExecutor]
}

至此,Executor 創建完畢,開始 Spark 的任務調度過程。

五、spark 任務調度源碼解析

之前的資源調度源碼已經跟進到了 Executor 創建完畢,並且在 Executor 中創建了 ThreadPool,此時就是開始了執行調度,我們隨機找一個 Action 算子,count() 來分析源碼:

def count(): Long = sc.runJob(this, Utils.getIteratorSize _).sum

sc 是 SparkContext 的對象,進入 runJob 方法:

def runJob[T, U: ClassTag](rdd: RDD[T], func: Iterator[T] => U): Array[U] = {
  runJob(rdd, func, 0 until rdd.partitions.length)
}

再次進入 runJob 方法:

def runJob[T, U: ClassTag](
    rdd: RDD[T],
    func: Iterator[T] => U,
    partitions: Seq[Int]): Array[U] = {
  val cleanedFunc = clean(func)
  runJob(rdd, (ctx: TaskContext, it: Iterator[T]) => cleanedFunc(it), partitions)
}

再次進入 runJob 方法:

def runJob[T, U: ClassTag](
   rdd: RDD[T],
   func: (TaskContext, Iterator[T]) => U,
   partitions: Seq[Int]): Array[U] = {
   val results = new Array[U](partitions.size)
   runJob[T, U](rdd, func, partitions, (index, res) => results(index) = res)
   results
 }

再進此進入 runJob 方法,發現了 rdd.doCheckpoint() 方法,該方法的功能就是對整個 RDD 進行回溯;在 runJob 中傳了一個參數 rdd,跟進該參數,在 dagScheduler.runJob(rdd, cleanedFunc, partitions, callSite, resultHandler, localProperties.get) 中調用:

def runJob[T, U: ClassTag](
    rdd: RDD[T],
    func: (TaskContext, Iterator[T]) => U,
    partitions: Seq[Int],
    resultHandler: (Int, U) => Unit): Unit = {
  if (stopped.get()) {
    throw new IllegalStateException("SparkContext has been shutdown")
  }
  val callSite = getCallSite
  val cleanedFunc = clean(func)
  logInfo("Starting job: " + callSite.shortForm)
  if (conf.getBoolean("spark.logLineage", false)) {
    logInfo("RDD's recursive dependencies:\n" + rdd.toDebugString)
  }
  dagScheduler.runJob(rdd, cleanedFunc, partitions, callSite, resultHandler, localProperties.get)
  progressBar.foreach(_.finishAll())
  rdd.doCheckpoint()
}

進入方法後繼續跟進 rdd 這個參數,在 submitJob 中調用,跟進 submitJob 方法中的 rdd 參數在何時使用:

//提交任務。eventProcessLoop 是 DAGSchedulerEventProcessLoop 對象
eventProcessLoop.post(JobSubmitted(
  jobId, rdd, func2, partitions.toArray, callSite, waiter,
  SerializationUtils.clone(properties)))

rdd 傳進了 eventProcessLoop.post 方法,進入 post 方法,該方法內部調用了 eventQueue.put(event),功能是將

提交的任務放入隊列:

//將提交的任務放入隊列
def post(event: E): Unit = {
  eventQueue.put(event)
}

觀察 eventQueue 類型:

private val eventQueue: BlockingQueue[E] = new LinkedBlockingDeque[E]()

eventQueue 存在於 EventLoop 類中,該類有一個 run() 方法,內部邏輯是處理剛剛被 eventProcessLoop.post 進去的新的任務,處理邏輯被封裝到 onReceive(event) 方法中,進入該方法無法看到代碼邏輯,此時去找 eventProcessLoop 的父類,跟進代碼後發現 eventProcessLoop 是 DAGSchedulerEventProcessLoop 實例對象:

private[scheduler] val eventProcessLoop = new DAGSchedulerEventProcessLoop(this)

在 DAGSchedulerEventProcessLoop 類中找到 onReceive 方法,觀察邏輯:

override def onReceive(event: DAGSchedulerEvent): Unit = {
  val timerContext = timer.time()
  try {
    doOnReceive(event)
  } finally {
    timerContext.stop()
  }
}

在 doOnReceive(event) 中運行 handleJobSubmitted 方法:

// 運行handleJobSubmitted 方法
dagScheduler.handleJobSubmitted(jobId, rdd, func, partitions, callSite, listener, properties)

在 handleJobSubmitted 方法中通過 submitStage(finalStage) 根據寬窄依賴遞歸尋找 Stage 並提交:

//遞歸尋找stage
private def submitStage(stage: Stage) {
  val jobId = activeJobForStage(stage)
  if (jobId.isDefined) {
    logDebug("submitStage(" + stage + ")")
    if (!waitingStages(stage) && !runningStages(stage) && !failedStages(stage)) {
      val missing = getMissingParentStages(stage).sortBy(_.id)
      logDebug("missing: " + missing)
      if (missing.isEmpty) {
        logInfo("Submitting " + stage + " (" + stage.rdd + "), which has no missing parents")
        //最終會執行 submitMissingTasks 方法
        submitMissingTasks(stage, jobId.get)
      } else {
        for (parent <- missing) {
          submitStage(parent)
        }
        waitingStages += stage
      }
    }
  } else {
    abortStage(stage, "No active job for stage " + stage.id, None)
  }
}

劃分完 Stage 之後,submitMissingTasks(stage, jobId.get) 方法會將 stage 劃分成 task 發送到 Exeuctor 中執行,並以以 TaskSet 形式提交任務:

//以taskSet形式提交任務
taskScheduler.submitTasks(new TaskSet(
  tasks.toArray, stage.id, stage.latestInfo.attemptNumber, jobId, properties))

提交task, 最後會執行 backend.reviveOffers() 調用的是 CoarseGrainedSchedulerBackend 對象中的方法:

backend.reviveOffers()

給 Driver 提交 task,在當前類中的 DriverEndpoint 中有 receive 方法來接收數據:

override def reviveOffers() {
  //給Driver 提交task,在當前類中的DriverEndpoint中 有receive方法來接收數據
  driverEndpoint.send(ReviveOffers)
}

找到 makeOffers():

case ReviveOffers =>
    makeOffers()

在該方法內部會去執行 launchTasks(taskDescs) 就是去 Executor 中啓動Task:

private def makeOffers() {
  // Make sure no executor is killed while some task is launching on it
  val taskDescs = CoarseGrainedSchedulerBackend.this.synchronized {
    // Filter out executors under killing
    val activeExecutors = executorDataMap.filterKeys(executorIsAlive)
    val workOffers = activeExecutors.map {
      case (id, executorData) =>
        new WorkerOffer(id, executorData.executorHost, executorData.freeCores)
    }.toIndexedSeq
    scheduler.resourceOffers(workOffers)
  }
  if (!taskDescs.isEmpty) {
    //去Executor中啓動Task
    launchTasks(taskDescs)
  }
}

在 launchTasks(taskDescs) 中就是給 Executor 發送 task 執行:

executorData.executorEndpoint.send(LaunchTask(new SerializableBuffer(serializedTask)))

在 CoarseGrainedExecutorBackend 中會有receive方法匹配 LaunchTask:

//啓動Task
case LaunchTask(data) =>
  if (executor == null) {
    exitExecutor(1, "Received LaunchTask command but executor was null")
  } else {
    val taskDesc = TaskDescription.decode(data.value)
    logInfo("Got assigned task " + taskDesc.taskId)
    //Executor 啓動Task
    executor.launchTask(this, taskDesc)
  }

得到 task,在線程池中之行:

// 得到task 在線程池中執行
def launchTask(context: ExecutorBackend, taskDescription: TaskDescription): Unit = {
  val tr = new TaskRunner(context, taskDescription)
  runningTasks.put(taskDescription.taskId, tr)
  threadPool.execute(tr)
}

至此,任務調度完畢。

六、總結

一、Spark資源調度流程圖:

流程詳解如下:

  1. 集羣啓動之後,Worker節點會向Master節點彙報資源情況,Master就掌握了集羣資源情況。
  2. 當Spark提交一個Application後,會根據RDD之間的依賴關係將Application形成一個DAG有向無環圖。任務提交之後,Spark會在Driver端創建兩個對象:DAGScheduler和TaskScheduler
  3. DAGScheduler是任務調度的高層調度器,是一個對象。DAGScheduler的主要作用就是將DAG根據RDD之間的寬窄依賴關係劃分爲一個個的Stage,然後將這些Stage以TaskSet的形式提交給TaskScheduler(TaskScheduler是任務調度的低層調度器,這裏TaskSet其實就是一個集合,裏面封裝的就是一個個的task任務,也就是stage中的並行度task任務)
  4. TaskSchedule會遍歷TaskSet集合,拿到每個task後會將task發送到計算節點Executor中去執行(其實就是發送到Executor中的線程池ThreadPool去執行)。
  5. task在Executor線程池中的運行情況會向TaskScheduler反饋,當task執行失敗時,則由TaskScheduler負責重試,將task重新發送給Executor去執行,默認重試3次。如果重試3次依然失敗,那麼這個task所在的stage就失敗了。stage失敗了則由DAGScheduler來負責重試,重新發送TaskSet到TaskSchdeuler,Stage默認重試4次。如果重試4次以後依然失敗,那麼這個job就失敗了。job失敗了,Application就失敗了。因此一個task默認情況下重試3*4=12次。
  6. TaskScheduler不僅能重試失敗的task,還會重試straggling(落後,緩慢)task(也就是執行速度比其他task慢太多的task)。如果有運行緩慢的task那麼TaskScheduler會啓動一個新的task來與這個運行緩慢的task執行相同的處理邏輯。兩個task哪個先執行完,就以哪個task的執行結果爲準。這就是Spark的推測執行機制。在Spark中推測執行默認是關閉的。推測執行可以通過spark.speculation屬性來配置。

注意:

1.對於ETL類型要入數據庫的業務要關閉推測執行機制,這樣就不會有重複的數據入庫

2.如果遇到數據傾斜的情況,開啓推測機制則有可能導致一直會有task重新啓動處理相同的邏輯,任務可能一直處於處理不完的狀態

二、粗粒度資源申請和細粒度資源申請

粗粒度資源申請(Spark):

在Applicatioin執行之前,將所有的資源申請完畢,當資源申請成功後,纔會進行任務的調度,當所有的task執行完成後,纔會釋放這部分資源。

優點:在Application執行之前,所有的資源都申請完畢,每一個task直接使用資源就可以了,不需要task在執行前自己去申請資源,task啓動就快了,stage執行就快了,job就快了,application執行就快了

缺點:直到最後一個task執行完成纔會釋放資源,集羣的資源無法充分利用

細粒度資源申請(MapReduce):

Application執行之前不需要先去申請資源,而是直接執行,讓job中的每一個task在執行前自己去申請資源,task執行完成後,就釋放資源

優點:集羣資源可以充分利用

缺點:task自己去申請資源,task啓動變慢,Application的運行就相應的變慢了。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章