Spark On Yarn 遠程idea提交運行(不是調試)

Spark On Yarn 遠程idea提交運行(不是調試)

1. 需要注意的問題

1.1 centos搭建的集羣會出現is running beyond virtual memory limits的問題

Current usage: xx MB of xxGB physical memory used; xx GB of xx GB virtual memory used. 

解決方法:

# yarn-site.xml中添加以下屬性
        <property>
                <name>yarn.nodemanager.vmem-check-enabled</name>
                <value>false</value>
        </property>

1.2 在linux下使用idea連接docker搭建的集羣,之間雖然能夠互相ping通,但是還是有防火牆依然會讓集羣不能訪問宿主機

19/01/21 16:44:16 INFO Client: Application report for application_1548058747747_0006 (state: ACCEPTED)

程序運行一直出現這個記錄, 解決辦法:關閉防火牆

1.3 宿主機佔不到集羣,一直使用0.0.0.0:8032端口(這一步設置很重要)

這是因爲沒有把resource資源文件設置成資源文件, 解決方案:
右鍵點擊resource文件,選擇Mark Directory as >> Resources root

2. 最終文件形式(src部分)

在idea新建項目, sbt構建項目, sbt版本隨意, scala版本選擇2.11.8, 因爲我的集羣中沒有專門配置scala,因此用spark-2.3.1-bin-hadoop2.7自帶的scala, 其版本號就是2.11.8, src目錄如下

# 右鍵點擊resource選擇Mark Directory as >> Resources root, 或者去project struct設置
src
├── main
│   ├── resource
│   │   ├── core-site.xml
│   │   ├── hdfs-site.xml
│   │   └── yarn-site.xml
│   └── scala
│       ├── SparkPI.scala
│       └── WordCount.scala
└── test
    └── scala

2.1 以提交wordcount爲例子

單單這些代碼是不能運行的,還需要設置集羣,1) 添加集羣jars包, 2) 使用sbt打包

import org.apache.spark.SparkConf
import org.apache.spark.SparkContext

object WordCount {

  def main(args: Array[String]): Unit = {
    System.setProperty("HADOOP_USER_NAME", "root")
    System.setProperty("user.name", "root")

    val conf = new SparkConf().setAppName("WordCount").setMaster("yarn")
      .set("deploy-mode", "client")
      .set("spark.yarn.jars", "hdfs:/user/root/jars/*")  //集羣的jars包,是你自己上傳上去的
      .setJars(List("/home/lee/IdeaProjects/test/target/scala-2.11/test_2.11-0.1.jar")) //這是sbt打包後的文件
      .setIfMissing("spark.driver.host", "192.168.1.9") //設置你自己的ip

    val sc = new SparkContext(conf)

    val rdd = sc.textFile("hdfs:/input/README.txt")
    val count = rdd.flatMap(_.split(" ")).map((_, 1)).reduceByKey(_+_)
    count.collect().foreach(println)
  }
}

2.2 依賴

# build.sbt中添加一下內容
// https://mvnrepository.com/artifact/org.apache.spark/spark-yarn
libraryDependencies += "org.apache.spark" %% "spark-yarn" % "2.3.1"

3. 步驟

3.1 設置jars

注意 wordcountconf中的.set("spark.yarn.jars", "hdfs:/user/root/jars/*"),這裏面由於沒有在本地添加spark的jars包,因此直接使用集羣中的jars包, 這個包需要在集羣裏面提交

# 在docker環境下, 可以使用如下指令
docker exec spark-master /opt/module/hadoop/bin/hdfs dfs -mkdir /input
docker exec spark-master /opt/module/hadoop/bin/hdfs dfs -mkdir /user
docker exec spark-master /opt/module/hadoop/bin/hdfs dfs -mkdir /user/root
docker exec spark-master /opt/module/hadoop/bin/hdfs dfs -mkdir /user/root/jars
docker exec spark-master /opt/module/hadoop/bin/hdfs dfs -put /opt/module/spark/jars/* /user/root/jars
docker exec spark-master /opt/module/hadoop/bin/hdfs dfs -put /opt/module/hadoop/README.txt /input
# /opt/module/hadoop/ 是你自己的hadoop目錄
# /opt/module/spark/ 是你自己的spark目錄
# 在集羣中,假如環境都設置好了,那麼就可以
hdfs dfs -mkdir /input
hdfs dfs -mkdir /user
hdfs dfs -mkdir /user/root
hdfs dfs -mkdir /user/root/jars
hdfs dfs -put  your_spark_path/jars/* /user/root/jars
hdfs dfs -put /opt/module/hadoop/README.txt /input

當然如果你不喜歡用/user/root目錄來放jars,那麼也可以自定義,當然在wordcount裏面就要做出對應改變了。

3.2 選用本地jars包(與3.1二選一)

如果不想提交spark的jars包到集羣,那麼可以把spark的jars可以複製到項目裏

ls /opt/module/spark
bin  conf  data  examples  jars  kubernetes  LICENSE  licenses  logs  NOTICE  python  R  README.md  RELEASE  sbin  work  yarn

對就是SPARK_HOME目錄下的jars文件夾, 複製到項目, 最終你的 your_project/jars裏面應該是下面這些內容

activation-1.1.1.jar                         hadoop-yarn-client-2.7.3.jar               metrics-graphite-3.1.5.jar 
......
zstd-jni-1.3.2-2.jar
hadoop-yarn-api-2.7.3.jar                    metrics-core-3.1.5.jar

選擇file>>project structure>>module, 選擇name方框下的dependecies,在點擊該欄目右上方的+號, 選擇1. jars and Directories, 再彈出框中選擇 your_project/jars

3.3 打包

在idea底部選擇sbt shell
第一次輸入clean
第二次輸入package
如果選擇其他的打包方式,那就需要修改confsetJars

4. 運行

19/01/21 16:44:41 INFO YarnScheduler: Removed TaskSet 1.0, whose tasks have all completed, from pool 
19/01/21 16:44:41 INFO DAGScheduler: ResultStage 1 (collect at WordCount.scala:20) finished in 0.827 s
19/01/21 16:44:41 INFO DAGScheduler: Job 0 finished: collect at WordCount.scala:20, took 6.945556 s
(under,1)
(this,3)
(distribution,2)
(Technology,1)
(country,1)
(is,1)
(Jetty,1)
(currently,1)
(permitted.,1)
(check,1)
(have,1)
(Security,1)
(U.S.,1)
(with,1)
(BIS,1)
(This,1)
(mortbay.org.,1)
((ECCN),1)
(using,2)
(security,1)
(Department,1)
(export,1)
(reside,1)
(any,1)
(algorithms.,1)
(from,1)
(re-export,2)
(has,1)
(SSL,1)
(Industry,1)
(Administration,1)
(details,1)
(provides,1)
(http://hadoop.apache.org/core/,1)
(country's,1)
(Unrestricted,1)
(740.13),1)
(policies,1)
(country,,1)
(concerning,1)
(uses,1)
(Apache,1)
(possession,,2)
(information,2)
(our,2)
(as,1)
(,18)
(Bureau,1)
(wiki,,1)
(please,2)
(form,1)
(information.,1)
(ENC,1)
(Export,2)
(included,1)
(asymmetric,1)
(Commodity,1)
(Software,2)
(For,1)
(it,1)
(The,4)
(about,1)
(visit,1)
(website,1)
(<http://www.wassenaar.org/>,1)
(performing,1)
(Section,1)
(on,2)
((see,1)
(http://wiki.apache.org/hadoop/,1)
(classified,1)
(following,1)
(in,1)
(object,1)
(cryptographic,3)
(which,2)
(See,1)
(encryption,3)
(Number,1)
(and/or,1)
(software,2)
(for,3)
((BIS),,1)
(makes,1)
(at:,2)
(manner,1)
(Core,1)
(latest,1)
(your,1)
(may,1)
(the,8)
(Exception,1)
(includes,2)
(restrictions,1)
(import,,2)
(project,1)
(you,1)
(use,,2)
(another,1)
(if,1)
(or,2)
(Commerce,,1)
(source,1)
(software.,2)
(laws,,1)
(BEFORE,1)
(Hadoop,,1)
(License,1)
(written,1)
(code,1)
(Regulations,,1)
(software,,2)
(more,2)
(software:,1)
(see,1)
(regulations,1)
(of,5)
(libraries,1)
(by,1)
(exception,1)
(Control,1)
(code.,1)
(eligible,1)
(both,1)
(to,2)
(Foundation,1)
(Government,1)
(functions,1)
(and,6)
(5D002.C.1,,1)
((TSU),1)
(Hadoop,1)
19/01/21 16:44:42 INFO SparkContext: Invoking stop() from shutdown hook
19/01/21 16:44:42 INFO SparkUI: Stopped Spark web UI at http://192.168.1.9:4040
19/01/21 16:44:42 INFO YarnClientSchedulerBackend: Interrupting monitor thread
19/01/21 16:44:42 INFO YarnClientSchedulerBackend: Shutting down all executors
19/01/21 16:44:42 INFO YarnSchedulerBackend$YarnDriverEndpoint: Asking each executor to shut down
19/01/21 16:44:42 INFO SchedulerExtensionServices: Stopping SchedulerExtensionServices
(serviceOption=None,
 services=List(),
 started=false)
19/01/21 16:44:42 INFO YarnClientSchedulerBackend: Stopped
19/01/21 16:44:42 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
19/01/21 16:44:42 INFO MemoryStore: MemoryStore cleared
19/01/21 16:44:42 INFO BlockManager: BlockManager stopped
19/01/21 16:44:42 INFO BlockManagerMaster: BlockManagerMaster stopped
19/01/21 16:44:42 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
19/01/21 16:44:42 INFO SparkContext: Successfully stopped SparkContext
19/01/21 16:44:42 INFO ShutdownHookManager: Shutdown hook called
19/01/21 16:44:42 INFO ShutdownHookManager: Deleting directory /tmp/spark-88c6c289-4d49-4035-96d7-19ba6410ef8a
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章