Background

在前文Spark源碼分析之-scheduler模塊中提到了Spark在資源管理和調度上採用了Hadoop YARN的方式：外層的資源管理器和應用內的任務調度器；並且分析了Spark應用內的任務調度模塊。本文就Spark的外層資源管理器-deploy模塊進行分析，探究Spark是如何協調應用之間的資源調度和管理的。

Spark最初是交由Mesos進行資源管理，爲了使得更多的用戶，包括沒有接觸過Mesos的用戶使用Spark，Spark的開發者添加了Standalone的部署方式，也就是deploy模塊。因此deploy模塊只針對不使用Mesos進行資源管理的部署方式。

Deploy模塊整體架構

deploy模塊主要包含3個子模塊：master, worker, client。他們繼承於Actor，通過actor實現互相之間的通信。

Master：master的主要功能是接收worker的註冊並管理所有的worker，接收client提交的application，(FIFO)調度等待的application並向worker提交。
Worker：worker的主要功能是向master註冊自己，根據master發送的application配置進程環境，並啓動StandaloneExecutorBackend。
Client：client的主要功能是向master註冊並監控application。當用戶創建SparkContext時會實例化SparkDeploySchedulerBackend，而實例化SparkDeploySchedulerBackend的同時就會啓動client，通過向client傳遞啓動參數和application有關信息，client向master發送請求註冊application並且在slave node上啓動StandaloneExecutorBackend。

下面來看一下deploy模塊的類圖：

Deploy模塊通信消息

Deploy模塊並不複雜，代碼也不多，主要集中在各個子模塊之間的消息傳遞和處理上，因此在這裏列出了各個模塊之間傳遞的主要消息：

client to master
1. RegisterApplication (向master註冊application)
master to client
1. RegisteredApplication (作爲註冊application的reply，回覆給client)
2. ExecutorAdded (通知client worker已經啓動了Executor環境，當向worker發送LaunchExecutor後通知client)
3. ExecutorUpdated (通知client Executor狀態已經發生變化了，包括結束、異常退出等，當worker向master發送ExecutorStateChanged後通知client)
master to worker
1. LaunchExecutor (發送消息啓動Executor環境)
2. RegisteredWorker (作爲worker向master註冊的reply)
3. RegisterWorkerFailed (作爲worker向master註冊失敗的reply)
4. KillExecutor (發送給worker請求停止executor環境)
worker to master
1. RegisterWorker (向master註冊自己)
2. Heartbeat (定期向master發送心跳信息)
3. ExecutorStateChanged (向master發送Executor狀態改變信息)

Deploy模塊代碼詳解

Deploy模塊相比於scheduler模塊簡單，因此對於deploy模塊的代碼並不做十分細節的分析，只針對application的提交和結束過程做一定的分析。

Client提交application

Client是由SparkDeploySchedulerBackend創建被啓動的，因此client是被嵌入在每一個application中，只爲這個applicator所服務，在client啓動時首先會先master註冊application：

def start() {
// Just launch an actor; it will call back into the listener.
actor = actorSystem.actorOf(Props(new ClientActor))
}
override def preStart() {
logInfo("Connecting to master " + masterUrl)
try {
master = context.actorFor(Master.toAkkaUrl(masterUrl))
masterAddress = master.path.address
master ! RegisterApplication(appDescription) //向master註冊application
context.system.eventStream.subscribe(self, classOf[RemoteClientLifeCycleEvent])
context.watch(master) // Doesn't work with remote actors, but useful for testing
} catch {
case e: Exception =>
logError("Failed to connect to master", e)
markDisconnected()
context.stop(self)
}
}

Master在收到RegisterApplication請求後會把application加到等待隊列中，等待調度：

case RegisterApplication(description) => {
logInfo("Registering app " + description.name)
val app = addApplication(description, sender)
logInfo("Registered app " + description.name + " with ID " + app.id)
waitingApps += app
context.watch(sender) // This doesn't work with remote actors but helps for testing
sender ! RegisteredApplication(app.id)
schedule()
}

Master會在每次操作後調用schedule()函數，以確保等待的application能夠被及時調度。

在前面提到deploy模塊是資源管理模塊，那麼Spark的deploy管理的是什麼資源，資源以什麼單位進行調度的呢？在當前版本的Spark中，集羣的cpu數量是Spark資源管理的一個標準，每個提交的application都會標明自己所需要的資源數(也就是cpu的core數)，Master以FIFO的方式管理所有的application請求，當資源數量滿足當前任務執行需求的時候該任務就會被調度，否則就繼續等待，當然如果master能給予當前任務部分資源則也會啓動該application。schedule()函數實現的就是此功能。

def schedule() {
if (spreadOutApps) {
for (app <- waitingApps if app.coresLeft > 0) {
val usableWorkers = workers.toArray.filter(_.state == WorkerState.ALIVE)
.filter(canUse(app, _)).sortBy(_.coresFree).reverse
val numUsable = usableWorkers.length
val assigned = new Array[Int](numUsable) // Number of cores to give on each node
var toAssign = math.min(app.coresLeft, usableWorkers.map(_.coresFree).sum)
var pos = 0
while (toAssign > 0) {
if (usableWorkers(pos).coresFree - assigned(pos) > 0) {
toAssign -= 1
assigned(pos) += 1
}
pos = (pos + 1) % numUsable
}
// Now that we've decided how many cores to give on each node, let's actually give them
for (pos <- 0 until numUsable) {
if (assigned(pos) > 0) {
val exec = app.addExecutor(usableWorkers(pos), assigned(pos))
launchExecutor(usableWorkers(pos), exec, app.desc.sparkHome)
app.state = ApplicationState.RUNNING
}
}
}
} else {
// Pack each app into as few nodes as possible until we've assigned all its cores
for (worker <- workers if worker.coresFree > 0 && worker.state == WorkerState.ALIVE) {
for (app <- waitingApps if app.coresLeft > 0) {
if (canUse(app, worker)) {
val coresToUse = math.min(worker.coresFree, app.coresLeft)
if (coresToUse > 0) {
val exec = app.addExecutor(worker, coresToUse)
launchExecutor(worker, exec, app.desc.sparkHome)
app.state = ApplicationState.RUNNING
}
}
}
}
}
}

當application得到調度後就會調用launchExecutor()向worker發送請求，同時向client彙報狀態：

def launchExecutor(worker: WorkerInfo, exec: ExecutorInfo, sparkHome: String) {
worker.addExecutor(exec)
worker.actor ! LaunchExecutor(exec.application.id, exec.id, exec.application.desc, exec.cores, exec.memory, sparkHome)
exec.application.driver ! ExecutorAdded(exec.id, worker.id, worker.host, exec.cores, exec.memory)
}

至此client與master的交互已經轉向了master與worker的交互，worker需要配置application啓動環境

case LaunchExecutor(appId, execId, appDesc, cores_, memory_, execSparkHome_) =>
val manager = new ExecutorRunner(
appId, execId, appDesc, cores_, memory_, self, workerId, ip, new File(execSparkHome_), workDir)
executors(appId + "/" + execId) = manager
manager.start()
coresUsed += cores_
memoryUsed += memory_
master ! ExecutorStateChanged(appId, execId, ExecutorState.RUNNING, None, None)

Worker在接收到LaunchExecutor消息後創建ExecutorRunner實例，同時彙報master executor環境啓動。

ExecutorRunner在啓動的過程中會創建線程，配置環境，啓動新進程：

def start() {
workerThread = new Thread("ExecutorRunner for " + fullId) {
override def run() { fetchAndRunExecutor() }
}
workerThread.start()
// Shutdown hook that kills actors on shutdown.
...
}
def fetchAndRunExecutor() {
try {
// Create the executor's working directory
val executorDir = new File(workDir, appId + "/" + execId)
if (!executorDir.mkdirs()) {
throw new IOException("Failed to create directory " + executorDir)
}
// Launch the process
val command = buildCommandSeq()
val builder = new ProcessBuilder(command: _*).directory(executorDir)
val env = builder.environment()
for ((key, value) <- appDesc.command.environment) {
env.put(key, value)
}
env.put("SPARK_MEM", memory.toString + "m")
// In case we are running this from within the Spark Shell, avoid creating a "scala"
// parent process for the executor command
env.put("SPARK_LAUNCH_WITH_SCALA", "0")
process = builder.start()
// Redirect its stdout and stderr to files
redirectStream(process.getInputStream, new File(executorDir, "stdout"))
redirectStream(process.getErrorStream, new File(executorDir, "stderr"))
// Wait for it to exit; this is actually a bad thing if it happens, because we expect to run
// long-lived processes only. However, in the future, we might restart the executor a few
// times on the same machine.
val exitCode = process.waitFor()
val message = "Command exited with code " + exitCode
worker ! ExecutorStateChanged(appId, execId, ExecutorState.FAILED, Some(message),
Some(exitCode))
} catch {
case interrupted: InterruptedException =>
logInfo("Runner thread for executor " + fullId + " interrupted")
case e: Exception => {
logError("Error running executor", e)
if (process != null) {
process.destroy()
}
val message = e.getClass + ": " + e.getMessage
worker ! ExecutorStateChanged(appId, execId, ExecutorState.FAILED, Some(message), None)
}
}
}

在ExecutorRunner啓動後worker向master彙報ExecutorStateChanged，而master則將消息重新pack成爲ExecutorUpdated發送給client。

至此整個application提交過程基本結束，提交的過程並不複雜，主要涉及到的消息的傳遞。

Application的結束

由於各種原因(包括正常結束，異常返回等)會造成application的結束，我們現在就來看看applicatoin結束的整個流程。

application的結束往往會造成client的結束，而client的結束會被master通過Actor檢測到，master檢測到後會調用removeApplication()函數進行操作：

def removeApplication(app: ApplicationInfo) {
if (apps.contains(app)) {
logInfo("Removing app " + app.id)
apps -= app
idToApp -= app.id
actorToApp -= app.driver
addressToWorker -= app.driver.path.address
completedApps += app // Remember it in our history
waitingApps -= app
for (exec <- app.executors.values) {
exec.worker.removeExecutor(exec)
exec.worker.actor ! KillExecutor(exec.application.id, exec.id)
}
app.markFinished(ApplicationState.FINISHED) // TODO: Mark it as FAILED if it failed
schedule()
}
}

removeApplicatoin()首先會將application從master自身所管理的數據結構中刪除，其次它會通知每一個work，請求其KillExecutor。worker在收到KillExecutor後調用ExecutorRunner的kill()函數：

case KillExecutor(appId, execId) =>
val fullId = appId + "/" + execId
executors.get(fullId) match {
case Some(executor) =>
logInfo("Asked to kill executor " + fullId)
executor.kill()
case None =>
logInfo("Asked to kill unknown executor " + fullId)
}

在ExecutorRunner內部，它會結束監控線程，同時結束監控線程所啓動的進程，並且向worker彙報ExecutorStateChanged：

def kill() {
if (workerThread != null) {
workerThread.interrupt()
workerThread = null
if (process != null) {
logInfo("Killing process!")
process.destroy()
process.waitFor()
}
worker ! ExecutorStateChanged(appId, execId, ExecutorState.KILLED, None, None)
Runtime.getRuntime.removeShutdownHook(shutdownHook)
}
}

Application結束的同時清理了master和worker上的關於該application的所有信息，這樣關於application結束的整個流程就介紹完了，當然在這裏我們對於許多異常處理分支沒有細究，但這並不影響我們對主線的把握。

End

至此對於deploy模塊的分析暫告一個段落。deploy模塊相對來說比較簡單，也沒有特別複雜的邏輯結構，正如前面所說的deploy模塊是爲了能讓更多的沒有部署Mesos的集羣的用戶能夠使用Spark而實現的一種方案。

當然現階段看來還略微簡陋，比如application的調度方式(FIFO)是否會造成小應用長時間等待大應用的結束，是否有更好的調度策略；資源的衡量標準是否可以更多更合理，而不單單是cpu數量，因爲現實場景中有的應用是disk intensive，有的是network intensive，這樣就算cpu資源有富餘，調度新的application也不一定會很有意義。

總的來說作爲Mesos的一種簡單替代方式，deploy模塊對於推廣Spark還是有積極意義的。

Spark源碼分析——deploy模塊

Background

Deploy模塊整體架構

Deploy模塊通信消息

Deploy模塊代碼詳解

Client提交application

Application的結束

End

Hadoop Snappy安裝終極教程

Spark源碼分析——deploy模塊

Spark源碼解析——Shuffle

linux系統中oom killer策略

HDFS源碼學習（1）——NameNode主要數據結構

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結