1 啓動spark
(1) 啓動hadoop
啓動成功master節點進程:
Slave節點進程:
(2) 啓動spark(注意路徑)
啓動成功:
Slave1和slave2的進程如下:
2 將sparkPi.scala添加進工程
3 Idea會自動讀取src/org/apache/spark/examples下的scala文件
注意下面的路徑需要一致:
4 設置執行環境
爲什麼要設置執行環境?
首先我們直接運行SparkPi程序,右擊
可以看到出錯:
這個原因是找不到程序運行的master,我們需要配置spark的執行環境,
Spark的執行環境根據當前的集羣模式可以分爲以下幾類:
local 本地單線程
local[K] 本地多線程(指定K個內核)
local[*] 本地多線程(指定所有可用內核)
spark://HOST:PORT 連接到指定的 Sparkstandalone cluster master,需要指定端口。
mesos://HOST:PORT 連接到指定的 Mesos 集羣,需要指定端口。
yarn-client客戶端模式 連接到 YARN 集羣。需要配置HADOOP_CONF_DIR。
yarn-cluster集羣模式 連接到 YARN 集羣。需要配置HADOOP_CONF_DIR
下面我們配置spark執行環境
在SparkPI的下拉菜單中選擇“EditConfigurations”
4.1 分佈式環境中執行(master)
-Dspark.master=spark://192.168.189.130:7077
我們再次運行程序發現報如下錯誤:
問題原因:程序在運行的時候沒有把jar包提交到spark的worker上面導致運行的worker找不到被調用的類
解決:將要運行的程序達成jar包,然後調用JavaSparkContext的addJar方法將該jar包提交到spark集羣中,然後spark的master會將該jar包分發到各個worker上面
將程序打包:
因爲我們在每臺機器上安裝了scala和spark所以我們這裏可以去掉scala和spark
在程序中設置加載jar
執行完成後會生成工程的jar包
在程序中加載jar
將: val conf = new SparkConf().setAppName("Spark Pi")
修改爲: val conf = new SparkConf().setAppName("SparkPi").setJars(List("/root/IdeaProjects/SparkExampleWorkspace/out/artifacts/SparkExampleWorkspace_jar/SparkExampleWorkspace.jar"))
運行程序得到結果:
5 常見錯誤
5.1 Exception in thread "main"java.lang.NoSuchMethodError: scala.collection.immutable.HashSet$.empty()Lscala/collection/immutable/HashSet;
錯誤原因:使用的scala版本太高,將scala版本將爲2.10.x
5.2 park java api通過runas java application運行的方法
分類: Hadoop spark2014-07-08 16:40 2355人閱讀 評論(2) 收藏 舉報
先上代碼:
[python] view plaincopy
1. /*
2. * Licensed to the Apache Software Foundation (ASF) under one or more
3. * contributor license agreements. See the NOTICE file distributed with
4. * this work for additional information regarding copyright ownership.
5. * The ASF licenses this file to You under the Apache License, Version 2.0
6. * (the "License"); you may not use this file except in compliance with
7. * the License. You may obtain a copy of the License at
8. *
9. * http://www.apache.org/licenses/LICENSE-2.0
10. *
11. * Unless required by applicable law or agreed to in writing, software
12. * distributed under the License is distributed on an "AS IS" BASIS,
13. * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14. * See the License for the specific language governing permissions and
15. * limitations under the License.
16. */
17.
18.
19. import java.util.Arrays;
20. import java.util.regex.Pattern;
21.
22. import org.apache.spark.api.java.JavaPairRDD;
23. import org.apache.spark.api.java.JavaRDD;
24. import org.apache.spark.api.java.JavaSparkContext;
25. import org.apache.spark.api.java.function.FlatMapFunction;
26. import org.apache.spark.api.java.function.Function2;
27. import org.apache.spark.api.java.function.PairFunction;
28.
29. import scala.Tuple2;
30.
31. public final class JavaWordCount {
32. private static final Pattern SPACE = Pattern.compile(" ");
33.
34. public static void main(String[] args) throws Exception {
35.
36. if (args.length < 2) {
37. System.err.println("Usage: JavaWordCount <master> <file>");
38. System.exit(1);
39. }
40.
41. JavaSparkContext ctx = new JavaSparkContext(args[0], "JavaWordCount",
42. System.getenv("SPARK_HOME"), JavaSparkContext.jarOfClass(JavaWordCount.class));
43. ctx.addJar("/home/hadoop/Desktop/JavaSparkT.jar");
44. JavaRDD<String> lines = ctx.textFile(args[1], 1);
45.
46. JavaRDD<String> words = lines.flatMap(new FlatMapFunction<String, String>() {
47. @Override
48. public Iterable<String> call(String s) {
49. return Arrays.asList(SPACE.split(s));
50. }
51. });
52.
53. JavaPairRDD<String, Integer> ones = words.map(new PairFunction<String, String, Integer>() {
54. @Override
55. public Tuple2<String, Integer> call(String s) {
56. return new Tuple2<String, Integer>(s, 1);
57. }
58. });
59.
60. JavaPairRDD<String, Integer> counts = ones.reduceByKey(new Function2<Integer, Integer, Integer>() {
61. @Override
62. public Integer call(Integer i1, Integer i2) {
63. return i1 + i2;
64. }
65. });
66. counts.saveAsTextFile(args[2]);
67. // counts.s
68. /*List<Tuple2<String, Integer>> output = counts.collect();
69. for (Tuple2<?,?> tuple : output) {
70. System.out.println(tuple._1() + ": " + tuple._2());
71. }*/
72. System.exit(0);
73. }
74. }
這是spark 自帶的一個example 之前只能將代碼達成jar包然後在spark的bin目錄下面通過spark-class來運行,這樣我們就沒辦法將spark的程序你很好的融合到現有的系統中,所以我希望通過java函數調用的方式運行這段程序,在一段時間的摸索和老師的指導下發現根據報錯的意思應該是沒有將jar包提交到spark的worker上面 導致運行的worker找不到被調用的類,會報如下錯誤:
[python] view plaincopy
1. 4/07/07 10:26:10 INFO TaskSetManager: Serialized task 1.0:0 as 2194 bytes in 104 ms
2.
3. 14/07/07 10:26:11 WARN TaskSetManager: Lost TID 0 (task 1.0:0)
4.
5. 14/07/07 10:26:11 WARN TaskSetManager: Loss was due to java.lang.ClassNotFoundException
6.
7. java.lang.ClassNotFoundException: JavaWordCount$1
8.
9. at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
10.
11. at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
12.
13. at java.security.AccessController.doPrivileged(Native Method)
14.
15. at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
16.
17. at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
18.
19. at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
20.
21. at java.lang.Class.forName0(Native Method)
22.
23. at java.lang.Class.forName(Class.java:270)
24.
25. at org.apache.spark.serializer.JavaDeserializationStream$anon$1.resolveClass(JavaSerializer.scala:37)
26.
27. at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612)
28.
29. at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
30.
31. at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
32.
33. at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
34.
35. at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
36.
37. at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
38.
39. at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
40.
41. at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
42.
43. at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
44.
45. at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
46.
47. at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
48.
49. at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
50.
51. at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
52.
53. at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
54.
55. at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
56.
57. at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
58.
59. at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
60.
61. at scala.collection.immutable.$colon$colon.readObject(List.scala:362)
62.
63. at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
64.
65. at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
解決方案:將要運行的程序達成jar包,然後調用JavaSparkContext的addJar方法將該jar包提交到spark集羣中,然後spark的master會將該jar包分發到各個worker上面,
代碼如下:
這樣運行時就不會出現java.lang.ClassNotFoundException:JavaWordCount$1這樣的錯誤了
運行如下:
spark://localhost:7077 hdfs://localhost:9000/input/test.txt hdfs://localhost:9000/input/result.txt
然後會eclipse控制檯中會有如下log
[python] view plaincopy
1. 14/07/08 16:03:06 INFO Utils: Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
2. 14/07/08 16:03:06 WARN Utils: Your hostname, localhost resolves to a loopback address: 127.0.0.1; using 192.168.200.233 instead (on interface eth0)
3. 14/07/08 16:03:06 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
4. 14/07/08 16:03:07 INFO Slf4jLogger: Slf4jLogger started
5. 14/07/08 16:03:07 INFO Remoting: Starting remoting
6. 14/07/08 16:03:07 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://[email protected]:52469]
7. 14/07/08 16:03:07 INFO Remoting: Remoting now listens on addresses: [akka.tcp://[email protected]:52469]
8. 14/07/08 16:03:07 INFO SparkEnv: Registering BlockManagerMaster
9. 14/07/08 16:03:07 INFO DiskBlockManager: Created local directory at /tmp/spark-local-20140708160307-0a89
10. 14/07/08 16:03:07 INFO MemoryStore: MemoryStore started with capacity 484.2 MB.
11. 14/07/08 16:03:08 INFO ConnectionManager: Bound socket to port 47731 with id = ConnectionManagerId(192.168.200.233,47731)
12. 14/07/08 16:03:08 INFO BlockManagerMaster: Trying to register BlockManager
13. 14/07/08 16:03:08 INFO BlockManagerMasterActor$BlockManagerInfo: Registering block manager 192.168.200.233:47731 with 484.2 MB RAM
14. 14/07/08 16:03:08 INFO BlockManagerMaster: Registered BlockManager
15. 14/07/08 16:03:08 INFO HttpServer: Starting HTTP Server
16. 14/07/08 16:03:08 INFO HttpBroadcast: Broadcast server started at http://192.168.200.233:58077
17. 14/07/08 16:03:08 INFO SparkEnv: Registering MapOutputTracker
18. 14/07/08 16:03:08 INFO HttpFileServer: HTTP File server directory is /tmp/spark-86439c44-9a36-4bda-b8c7-063c5c2e15b2
19. 14/07/08 16:03:08 INFO HttpServer: Starting HTTP Server
20. 14/07/08 16:03:08 INFO SparkUI: Started Spark Web UI at http://192.168.200.233:4040
21. 14/07/08 16:03:08 INFO AppClient$ClientActor: Connecting to master spark://localhost:7077...
22. 14/07/08 16:03:09 INFO SparkDeploySchedulerBackend: Connected to Spark cluster with app ID app-20140708160309-0000
23. 14/07/08 16:03:09 INFO AppClient$ClientActor: Executor added: app-20140708160309-0000/0 on worker-20140708160246-localhost-34775 (localhost:34775) with 4 cores
24. 14/07/08 16:03:09 INFO SparkDeploySchedulerBackend: Granted executor ID app-20140708160309-0000/0 on hostPort localhost:34775 with 4 cores, 512.0 MB RAM
25. 14/07/08 16:03:09 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
26. 14/07/08 16:03:09 INFO AppClient$ClientActor: Executor updated: app-20140708160309-0000/0 is now RUNNING
27. 14/07/08 16:03:10 INFO SparkContext: Added JAR /home/hadoop/Desktop/JavaSparkT.jar at http://192.168.200.233:52827/jars/JavaSparkT.jar with timestamp 1404806590353
28. 14/07/08 16:03:10 INFO MemoryStore: ensureFreeSpace(138763) called with curMem=0, maxMem=507720499
29. 14/07/08 16:03:10 INFO MemoryStore: Block broadcast_0 stored as values to memory (estimated size 135.5 KB, free 484.1 MB)
30. 14/07/08 16:03:12 INFO SparkDeploySchedulerBackend: Registered executor: Actor[akka.tcp://sparkExecutor@localhost:42090/user/Executor#-1434031133] with ID 0
31. 14/07/08 16:03:13 INFO BlockManagerMasterActor$BlockManagerInfo: Registering block manager localhost:56831 with 294.9 MB RAM
32. 14/07/08 16:03:13 INFO FileInputFormat: Total input paths to process : 1
33. 14/07/08 16:03:13 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
34. 14/07/08 16:03:13 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
35. 14/07/08 16:03:13 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
36. 14/07/08 16:03:13 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
37. 14/07/08 16:03:13 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
38. 14/07/08 16:03:13 INFO SparkContext: Starting job: saveAsTextFile at JavaWordCount.java:66
39. 14/07/08 16:03:13 INFO DAGScheduler: Registering RDD 4 (reduceByKey at JavaWordCount.java:60)
40. 14/07/08 16:03:13 INFO DAGScheduler: Got job 0 (saveAsTextFile at JavaWordCount.java:66) with 1 output partitions (allowLocal=false)
41. 14/07/08 16:03:13 INFO DAGScheduler: Final stage: Stage 0 (saveAsTextFile at JavaWordCount.java:66)
42. 14/07/08 16:03:13 INFO DAGScheduler: Parents of final stage: List(Stage 1)
43. 14/07/08 16:03:13 INFO DAGScheduler: Missing parents: List(Stage 1)
44. 14/07/08 16:03:13 INFO DAGScheduler: Submitting Stage 1 (MapPartitionsRDD[4] at reduceByKey at JavaWordCount.java:60), which has no missing parents
45. 14/07/08 16:03:13 INFO DAGScheduler: Submitting 1 missing tasks from Stage 1 (MapPartitionsRDD[4] at reduceByKey at JavaWordCount.java:60)
46. 14/07/08 16:03:13 INFO TaskSchedulerImpl: Adding task set 1.0 with 1 tasks
47. 14/07/08 16:03:13 INFO TaskSetManager: Starting task 1.0:0 as TID 0 on executor 0: localhost (PROCESS_LOCAL)
48. 14/07/08 16:03:13 INFO TaskSetManager: Serialized task 1.0:0 as 2252 bytes in 39 ms
49. 14/07/08 16:03:17 INFO TaskSetManager: Finished TID 0 in 3310 ms on localhost (progress: 1/1)
50. 14/07/08 16:03:17 INFO DAGScheduler: Completed ShuffleMapTask(1, 0)
51. 14/07/08 16:03:17 INFO DAGScheduler: Stage 1 (reduceByKey at JavaWordCount.java:60) finished in 3.319 s
52. 14/07/08 16:03:17 INFO DAGScheduler: looking for newly runnable stages
53. 14/07/08 16:03:17 INFO DAGScheduler: running: Set()
54. 14/07/08 16:03:17 INFO DAGScheduler: waiting: Set(Stage 0)
55. 14/07/08 16:03:17 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
56. 14/07/08 16:03:17 INFO DAGScheduler: failed: Set()
57. 14/07/08 16:03:17 INFO DAGScheduler: Missing parents for Stage 0: List()
58. 14/07/08 16:03:17 INFO DAGScheduler: Submitting Stage 0 (MappedRDD[7] at saveAsTextFile at JavaWordCount.java:66), which is now runnable
59. 14/07/08 16:03:17 INFO DAGScheduler: Submitting 1 missing tasks from Stage 0 (MappedRDD[7] at saveAsTextFile at JavaWordCount.java:66)
60. 14/07/08 16:03:17 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
61. 14/07/08 16:03:17 INFO TaskSetManager: Starting task 0.0:0 as TID 1 on executor 0: localhost (PROCESS_LOCAL)
62. 14/07/08 16:03:17 INFO TaskSetManager: Serialized task 0.0:0 as 11717 bytes in 0 ms
63. 14/07/08 16:03:17 INFO MapOutputTrackerMasterActor: Asked to send map output locations for shuffle 0 to spark@localhost:37990
64. 14/07/08 16:03:17 INFO MapOutputTrackerMaster: Size of output statuses for shuffle 0 is 127 bytes
65. 14/07/08 16:03:18 INFO DAGScheduler: Completed ResultTask(0, 0)
66. 14/07/08 16:03:18 INFO TaskSetManager: Finished TID 1 in 1074 ms on localhost (progress: 1/1)
67. 14/07/08 16:03:18 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
68. 14/07/08 16:03:18 INFO DAGScheduler: Stage 0 (saveAsTextFile at JavaWordCount.java:66) finished in 1.076 s
69. 14/07/08 16:03:18 INFO SparkContext: Job finished: saveAsTextFile at JavaWordCount.java:66, took 4.719158065 s
程序執行結果如下:
[python] view plaincopy
1. [hadoop@localhost sbin]$ hadoop fs -ls hdfs://localhost:9000/input/result.txt
2. 14/07/08 16:04:22 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
3. Found 2 items
4. -rw-r--r-- 3 hadoop supergroup 0 2014-07-08 16:03 hdfs://localhost:9000/input/result.txt/_SUCCESS
5. -rw-r--r-- 3 hadoop supergroup 56 2014-07-08 16:03 hdfs://localhost:9000/input/result.txt/part-00000
6. [hadoop@localhost sbin]$ hadoop fs -cat hdfs://localhost:9000/input/result.txt/part-00000
7. 14/07/08 16:04:44 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
8. (caozw,1)
9. (hello,3)
10. (hadoop,1)
11. (2.2.0,1)
12. (world,1)
13. [hadoop@localhost sbin]$