任務提交後執行前的邏輯:
client端:
1、spark-submit腳本提交任務,會通過反射的方式調用到我們自己提交的類的main方法
2、執行我們自己代碼中的new SparkContext
2.1、創建actorSystem
2.2、創建TaskSchedulerImpl 任務分發的類
2.3、創建SparkDeploySchedulerBackend 調度任務
2.4、創建DAGScheduler 劃分任務,創建一個線程,task阻塞隊列
2.5、創建clinetActor,向master註冊app(job-jar)信息
2.6、創建driverActor
Master端:
1、master接收到app的註冊信息後將信息保存起來
2、通過worker的心跳和註冊,知道集羣中有多少資源
3、對比app需要的資源,然後根據集羣中有的資源,進行資源分配(打傘、集中)
4、通知worker,啓動executor
Worker端:
1、啓動executor
2、和driverActor交互,監聽和接收任務
集羣提交任務後的調度spark-submit
在spark-submit的腳本啓動啓動了SparkSubmit類,調用了SparkSubmit的main方法
下面我們看一下SparkSubmit的main方法:
def main(args: Array[String]): Unit = { val appArgs = new SparkSubmitArguments(args) if (appArgs.verbose) { printStream.println(appArgs) } appArgs.action match { case SparkSubmitAction.SUBMIT => submit(appArgs) case SparkSubmitAction.KILL => kill(appArgs) case SparkSubmitAction.REQUEST_STATUS => requestStatus(appArgs) } }
匹配到SUBMIT,看一下submit方法:
private[spark] def submit(args: SparkSubmitArguments): Unit = { val (childArgs, childClasspath, sysProps, childMainClass) = prepareSubmitEnvironment(args) def doRunMain(): Unit = { if (args.proxyUser != null) { val proxyUser = UserGroupInformation.createProxyUser(args.proxyUser, UserGroupInformation.getCurrentUser()) try { proxyUser.doAs(new PrivilegedExceptionAction[Unit]() { override def run(): Unit = { runMain(childArgs, childClasspath, sysProps, childMainClass, args.verbose) } }) } catch { case e: Exception => // Hadoop's AuthorizationException suppresses the exception's stack trace, which // makes the message printed to the output by the JVM not very helpful. Instead, // detect exceptions with empty stack traces here, and treat them differently. if (e.getStackTrace().length == 0) { printStream.println(s"ERROR: ${e.getClass().getName()}: ${e.getMessage()}") exitFn() } else { throw e } } } else { runMain(childArgs, childClasspath, sysProps, childMainClass, args.verbose) } } // In standalone cluster mode, there are two submission gateways: // (1) The traditional Akka gateway using o.a.s.deploy.Client as a wrapper // (2) The new REST-based gateway introduced in Spark 1.3 // The latter is the default behavior as of Spark 1.3, but Spark submit will fail over // to use the legacy gateway if the master endpoint turns out to be not a REST server. if (args.isStandaloneCluster && args.useRest) { try { printStream.println("Running Spark using the REST application submission protocol.") doRunMain() } catch { // Fail over to use the legacy submission gateway case e: SubmitRestConnectionException => printWarning(s"Master endpoint ${args.master} was not a REST server. " + "Falling back to legacy submission gateway instead.") args.useRest = false submit(args) } // In all other modes, just run the main class as prepared } else { doRunMain() } }
上部分代碼執行了一個runMain方法
private def runMain( childArgs: Seq[String], childClasspath: Seq[String], sysProps: Map[String, String], childMainClass: String, verbose: Boolean): Unit = { if (verbose) { printStream.println(s"Main class:\n$childMainClass") printStream.println(s"Arguments:\n${childArgs.mkString("\n")}") printStream.println(s"System properties:\n${sysProps.mkString("\n")}") printStream.println(s"Classpath elements:\n${childClasspath.mkString("\n")}") printStream.println("\n") } val loader = if (sysProps.getOrElse("spark.driver.userClassPathFirst", "false").toBoolean) { new ChildFirstURLClassLoader(new Array[URL](0), Thread.currentThread.getContextClassLoader) } else { new MutableURLClassLoader(new Array[URL](0), Thread.currentThread.getContextClassLoader) } Thread.currentThread.setContextClassLoader(loader) for (jar <- childClasspath) { addJarToClasspath(jar, loader) } for ((key, value) <- sysProps) { System.setProperty(key, value) } var mainClass: Class[_] = null try { mainClass = Class.forName(childMainClass, true, loader) } catch { case e: ClassNotFoundException => e.printStackTrace(printStream) if (childMainClass.contains("thriftserver")) { printStream.println(s"Failed to load main class $childMainClass.") printStream.println("You need to build Spark with -Phive and -Phive-thriftserver.") } System.exit(CLASS_NOT_FOUND_EXIT_STATUS) } // SPARK-4170 if (classOf[scala.App].isAssignableFrom(mainClass)) { printWarning("Subclasses of scala.App may not work correctly. Use a main() method instead.") } val mainMethod = mainClass.getMethod("main", new Array[String](0).getClass) if (!Modifier.isStatic(mainMethod.getModifiers)) { throw new IllegalStateException("The main method in the given main class must be static") } def findCause(t: Throwable): Throwable = t match { case e: UndeclaredThrowableException => if (e.getCause() != null) findCause(e.getCause()) else e case e: InvocationTargetException => if (e.getCause() != null) findCause(e.getCause()) else e case e: Throwable => e } try { mainMethod.invoke(null, childArgs.toArray) } catch { case t: Throwable => throw findCause(t) } }
通過上面的代碼,我們知道通過反射獲取我們寫的程序主類,然後invoke執行我們寫的main方法,
下面進入我們的wordCount程序:
object WordCount { def main(args: Array[String]): Unit = { // 創建conf,設置應用程序的名字和運行的方式,local[2]表示本地模式運行兩個線程,產生兩個文件結果 // val conf = new SparkConf().setAppName("wordcount").setMaster("local[2]") // 提交到集羣執行 val conf = new SparkConf().setAppName("wordcount") // 創建sparkcontext val sc = new SparkContext(conf) // 開始計算代碼 // textfile從hdfs中讀取代碼 val file: RDD[String] = sc.textFile(args(0)) // 壓平,分割每一行數據爲每個單詞 val words: RDD[String] = file.flatMap(_.split(" ")) val tuple: RDD[(String, Int)] = words.map((_, 1)) val result: RDD[(String, Int)] = tuple.reduceByKey(_ + _) val resultBy: RDD[(String, Int)] = result.sortBy(_._2, false) // 打印結果 // resultBy.foreach(println) resultBy.saveAsTextFile(args(1)) } }
sparkConf主要封裝一些配置信息,暫時不瞭解,我們看一下SparkContext的初始化方法
SparkContext 非常重要 它是spark提交任務的入口:
1、調用createSparkEnv方法創建sparkEnv,用於創建actorSystem
2、創建taskScheduler
3、創建DAGScheduler
4、taskScheduler.start
下面看一下SparkContext中的主要代碼片:
private[spark] def createSparkEnv( conf: SparkConf, isLocal: Boolean, listenerBus: LiveListenerBus): SparkEnv = { SparkEnv.createDriverEnv(conf, isLocal, listenerBus) } private[spark] val env = createSparkEnv(conf, isLocal, listenerBus) SparkEnv.set(env)
創建了sparkenv,用於創建actorSystem
/** * Create a SparkEnv for the driver. */ private[spark] def createDriverEnv( conf: SparkConf, isLocal: Boolean, listenerBus: LiveListenerBus, mockOutputCommitCoordinator: Option[OutputCommitCoordinator] = None): SparkEnv = { assert(conf.contains("spark.driver.host"), "spark.driver.host is not set on the driver!") assert(conf.contains("spark.driver.port"), "spark.driver.port is not set on the driver!") val hostname = conf.get("spark.driver.host") val port = conf.get("spark.driver.port").toInt create( conf, SparkContext.DRIVER_IDENTIFIER, hostname, port, isDriver = true, isLocal = isLocal, listenerBus = listenerBus, mockOutputCommitCoordinator = mockOutputCommitCoordinator ) }
/** * Helper method to create a SparkEnv for a driver or an executor. */ private def create( conf: SparkConf, executorId: String, hostname: String, port: Int, isDriver: Boolean, isLocal: Boolean, listenerBus: LiveListenerBus = null, numUsableCores: Int = 0, mockOutputCommitCoordinator: Option[OutputCommitCoordinator] = None): SparkEnv = { // Listener bus is only used on the driver if (isDriver) { assert(listenerBus != null, "Attempted to create driver SparkEnv with null listener bus!") } val securityManager = new SecurityManager(conf) // Create the ActorSystem for Akka and get the port it binds to. val (actorSystem, boundPort) = { val actorSystemName = if (isDriver) driverActorSystemName else executorActorSystemName AkkaUtils.createActorSystem(actorSystemName, hostname, port, conf, securityManager) }
上面的代碼主要是通過SparkEnv.createDriverEnv(conf,isLocal,listenerBus)最終創建了ActorSystem,並啓動
繼續看SparkContext下面的代碼
// Create and start the scheduler private[spark] var (schedulerBackend, taskScheduler) = SparkContext.createTaskScheduler(this, master) private val heartbeatReceiver = env.actorSystem.actorOf( Props(new HeartbeatReceiver(taskScheduler)), "HeartbeatReceiver") @volatile private[spark] var dagScheduler: DAGScheduler = _ try { dagScheduler = new DAGScheduler(this) } catch { case e: Exception => { try { stop() } finally { throw new SparkException("Error while constructing DAGScheduler", e) } } }
上面的代碼創建了TaskScheduler,用於提交任務
創建了DAGScheduler,用於切分app
進入TaskScheduler,看一下創建的細節:
/** * Create a task scheduler based on a given master URL. * Return a 2-tuple of the scheduler backend and the task scheduler. */ private def createTaskScheduler( sc: SparkContext, master: String): (SchedulerBackend, TaskScheduler) = { // Regular expression used for local[N] and local[*] master formats val LOCAL_N_REGEX = """local\[([0-9]+|\*)\]""".r // Regular expression for local[N, maxRetries], used in tests with failing tasks val LOCAL_N_FAILURES_REGEX = """local\[([0-9]+|\*)\s*,\s*([0-9]+)\]""".r // Regular expression for simulating a Spark cluster of [N, cores, memory] locally val LOCAL_CLUSTER_REGEX = """local-cluster\[\s*([0-9]+)\s*,\s*([0-9]+)\s*,\s*([0-9]+)\s*]""".r // Regular expression for connecting to Spark deploy clusters val SPARK_REGEX = """spark://(.*)""".r // Regular expression for connection to Mesos cluster by mesos:// or zk:// url val MESOS_REGEX = """(mesos|zk)://.*""".r // Regular expression for connection to Simr cluster val SIMR_REGEX = """simr://(.*)""".r // When running locally, don't try to re-execute tasks on failure. val MAX_LOCAL_TASK_FAILURES = 1 master match { case "local" => val scheduler = new TaskSchedulerImpl(sc, MAX_LOCAL_TASK_FAILURES, isLocal = true) val backend = new LocalBackend(scheduler, 1) scheduler.initialize(backend) (backend, scheduler) case LOCAL_N_REGEX(threads) => def localCpuCount = Runtime.getRuntime.availableProcessors() // local[*] estimates the number of cores on the machine; local[N] uses exactly N threads. val threadCount = if (threads == "*") localCpuCount else threads.toInt if (threadCount <= 0) { throw new SparkException(s"Asked to run locally with $threadCount threads") } val scheduler = new TaskSchedulerImpl(sc, MAX_LOCAL_TASK_FAILURES, isLocal = true) val backend = new LocalBackend(scheduler, threadCount) scheduler.initialize(backend) (backend, scheduler) case LOCAL_N_FAILURES_REGEX(threads, maxFailures) => def localCpuCount = Runtime.getRuntime.availableProcessors() // local[*, M] means the number of cores on the computer with M failures // local[N, M] means exactly N threads with M failures val threadCount = if (threads == "*") localCpuCount else threads.toInt val scheduler = new TaskSchedulerImpl(sc, maxFailures.toInt, isLocal = true) val backend = new LocalBackend(scheduler, threadCount) scheduler.initialize(backend) (backend, scheduler) case SPARK_REGEX(sparkUrl) => val scheduler = new TaskSchedulerImpl(sc) val masterUrls = sparkUrl.split(",").map("spark://" + _) val backend = new SparkDeploySchedulerBackend(scheduler, sc, masterUrls) scheduler.initialize(backend) (backend, scheduler)
上面的代碼主要是根據啓動模式進行匹配,首先創建了一個TaskSchedulerImpl,然後創建了backend,最後調用initialize:
def initialize(backend: SchedulerBackend) { this.backend = backend // temporarily set rootPool name to empty rootPool = new Pool("", schedulingMode, 0, 0) schedulableBuilder = { schedulingMode match { case SchedulingMode.FIFO => new FIFOSchedulableBuilder(rootPool) case SchedulingMode.FAIR => new FairSchedulableBuilder(rootPool, conf) } } schedulableBuilder.buildPools() }
上面的代碼中會選擇一個資源調度方式,一個是先進先出、一個是公平調度。