spark版本: 2.0.0
1.引入
爲了簡單測試項目功能,我們可能只需要使用spark-shell就可以完成,但是企業級項目我們都是先打包好之後使用spark-submit腳本完成線上的spark項目部署。
./bin/spark-submit \
--class com.example.spark.Test \
--master yarn \
--deploy-mode client \
/home/hadoop/data/test.jar
雖然spark-submit命令用了很多,但是其中的原理,相信很多人都不瞭解,現在我們來介紹一下這裏面的小祕密。
2.spark-submit腳本
spark-submit腳本很容易發現,最終它會執行org.apache.spark.deploy.SparkSubmit
的main方法,因此我們就從這裏開始。
def main(args: Array[String]): Unit = {
// 解析自定義參數
val appArgs = new SparkSubmitArguments(args)
// 是否詳細打印
if (appArgs.verbose) {
// scalastyle:off println
printStream.println(appArgs)
// scalastyle:on println
}
// submit支持三種類型的提交,主要是SparkSubmitAction.SUBMIT,其他兩種幾乎不使用
appArgs.action match {
// 提交任務(核心)
case SparkSubmitAction.SUBMIT => submit(appArgs)
// 只支持 Standalone and Mesos cluster
case SparkSubmitAction.KILL => kill(appArgs)
// 只支持 Standalone and Mesos cluster
case SparkSubmitAction.REQUEST_STATUS => requestStatus(appArgs)
}
}
submit方法是提交任務的入口,裏面包含參數校驗,根據情況更加附加參數,處理用戶參數,系統配置,各語言之間的兼容問題和執行權限等問題。
private def submit(args: SparkSubmitArguments): Unit = {
// 處理用戶參數,系統配置,各語言之間的兼容問題
val (childArgs, childClasspath, sysProps, childMainClass) = prepareSubmitEnvironment(args)
// 處理執行權限問題,最終都是調用方法runMain(childArgs, childClasspath, sysProps, childMainClass, args.verbose)
def doRunMain(): Unit = {
if (args.proxyUser != null) {
val proxyUser = UserGroupInformation.createProxyUser(args.proxyUser,
UserGroupInformation.getCurrentUser())
try {
proxyUser.doAs(new PrivilegedExceptionAction[Unit]() {
override def run(): Unit = {
runMain(childArgs, childClasspath, sysProps, childMainClass, args.verbose)
}
})
} catch {
case e: Exception =>
if (e.getStackTrace().length == 0) {
printStream.println(s"ERROR: ${e.getClass().getName()}: ${e.getMessage()}")
exitFn(1)
} else {
throw e
}
}
} else {
runMain(childArgs, childClasspath, sysProps, childMainClass, args.verbose)
}
}
// 增加日誌提醒
if (args.isStandaloneCluster && args.useRest) {
try {
printStream.println("Running Spark using the REST application submission protocol.")
doRunMain()
} catch {
case e: SubmitRestConnectionException =>
printWarning(s"Master endpoint ${args.master} was not a REST server. " +
"Falling back to legacy submission gateway instead.")
args.useRest = false
submit(args)
}
} else {
doRunMain()
}
}
上面真正有用的代碼其實並不多,除了最終調用方法runMain,這裏還需要簡單介紹一下prepareSubmitEnvironment
這個方法。如果看過源碼的同學,應該會發現這段代碼很長,不過大部分都是解決一些參數衝突,多語言等問題,只有少部分需要注意,所以接下來是跳讀。
// 1. 屬性介紹
private[deploy] def prepareSubmitEnvironment(args: SparkSubmitArguments)
: (Seq[String], Seq[String], Map[String, String], String) = {
// 運行類childMainClass的參數
val childArgs = new ArrayBuffer[String]()
// 需要附加的classpath
val childClasspath = new ArrayBuffer[String]()
// 解析參數/默認參數得出的系統配置集合(不是配置到系統配置中的,是通過計算得到的)
val sysProps = new HashMap[String, String]()
// 運行主類,這個不一定和指定的--class參數一致(如果deployMode=cluster(集羣模式),運行主類會改變,指定的主類會被裝飾之後運行)
var childMainClass = ""
// 2. 主類選擇
if (deployMode == CLIENT) {
// 客戶端模式,主類就是指定類
childMainClass = args.mainClass
....
// yarn cluster 模式,org.apache.spark.deploy.yarn.Client會裝飾用戶指定主類
if (isYarnCluster) {
childMainClass = "org.apache.spark.deploy.yarn.Client"
if (args.isPython) {
childArgs += ("--primary-py-file", args.primaryResource)
childArgs += ("--class", "org.apache.spark.deploy.PythonRunner")
} else if (args.isR) {
val mainFile = new Path(args.primaryResource).getName
childArgs += ("--primary-r-file", mainFile)
childArgs += ("--class", "org.apache.spark.deploy.RRunner")
} else {
if (args.primaryResource != SparkLauncher.NO_RESOURCE) {
childArgs += ("--jar", args.primaryResource)
}
childArgs += ("--class", args.mainClass)
}
if (args.childArgs != null) {
args.childArgs.foreach { arg => childArgs += ("--arg", arg) }
}
}
......
// 其他cluster模式也會改變運行主類,詳情看代碼
已經分析了參數的解析和補充,現在直接回到runMain方法。
private def runMain(
childArgs: Seq[String],
childClasspath: Seq[String],
sysProps: Map[String, String],
childMainClass: String,
verbose: Boolean): Unit = {
.....
// 指定類加載器
val loader =
if (sysProps.getOrElse("spark.driver.userClassPathFirst", "false").toBoolean) {
new ChildFirstURLClassLoader(new Array[URL](0),
Thread.currentThread.getContextClassLoader)
} else {
new MutableURLClassLoader(new Array[URL](0),
Thread.currentThread.getContextClassLoader)
}
Thread.currentThread.setContextClassLoader(loader)
for (jar <- childClasspath) {
addJarToClasspath(jar, loader)
}
// 更新系統配置
for ((key, value) <- sysProps) {
System.setProperty(key, value)
}
var mainClass: Class[_] = null
try {
// 查找主類
mainClass = Utils.classForName(childMainClass)
} catch {
case e: ClassNotFoundException =>
e.printStackTrace(printStream)
if (childMainClass.contains("thriftserver")) {
printStream.println(s"Failed to load main class $childMainClass.")
printStream.println("You need to build Spark with -Phive and -Phive-thriftserver.")
}
System.exit(CLASS_NOT_FOUND_EXIT_STATUS)
case e: NoClassDefFoundError =>
e.printStackTrace(printStream)
if (e.getMessage.contains("org/apache/hadoop/hive")) {
printStream.println(s"Failed to load hive class.")
printStream.println("You need to build Spark with -Phive and -Phive-thriftserver.")
}
System.exit(CLASS_NOT_FOUND_EXIT_STATUS)
}
// SPARK-4170
if (classOf[scala.App].isAssignableFrom(mainClass)) {
printWarning("Subclasses of scala.App may not work correctly. Use a main() method instead.")
}
// 執行主類的main方法
val mainMethod = mainClass.getMethod("main", new Array[String](0).getClass)
if (!Modifier.isStatic(mainMethod.getModifiers)) {
throw new IllegalStateException("The main method in the given main class must be static")
}
// 運行主類的錯誤處理
@tailrec
def findCause(t: Throwable): Throwable = t match {
case e: UndeclaredThrowableException =>
if (e.getCause() != null) findCause(e.getCause()) else e
case e: InvocationTargetException =>
if (e.getCause() != null) findCause(e.getCause()) else e
case e: Throwable =>
e
}
try {
// 傳入參數執行主類的main方法
mainMethod.invoke(null, childArgs.toArray)
} catch {
case t: Throwable =>
findCause(t) match {
case SparkUserAppException(exitCode) =>
System.exit(exitCode)
case t: Throwable =>
throw t
}
}
}
submit運行時主要包含以下幾個步驟:
- 選擇類加載器
- 添加classpath
- 修改系統配置
- 運行更新之後的主類