Spark 從零到開發(四)單詞計數的三種環境實現

實現一:spark shell

主要用於測試,在部署到集羣之前,自己使用集合測試數據來測試流程是否通順。

1.1 文件上傳hdfs

首先先得把文本文件上傳到HDFS上的spark目錄下
文本內容:

[root@s166 fantj]# cat spark.txt 
What is “version control”, and why should you care? Version control is a system that records changes to a file or set of files over time so that you can recall specific versions later. For the examples in this book, you will use software source code as the files being version controlled, though in reality you can do this with nearly any type of file on a computer.

If you are a graphic or web designer and want to keep every version of an image or layout (which you would most certainly want to), a Version Control System (VCS) is a very wise thing to use. It allows you to revert selected files back to a previous state, revert the entire project back to a previous state, compare changes over time, see who last modified something that might be causing a problem, who introduced an issue and when, and more. Using a VCS also generally means that if you screw things up or lose files, you can easily recover. In addition, you get all this for very little overhead.

Local Version Control Systems
Many people’s version-control method of choice is to copy files into another directory (perhaps a time-stamped directory, if they’re clever). This approach is very common because it is so simple, but it is also incredibly error prone. It is easy to forget which directory you’re in and accidentally write to the wrong file or copy over files you don’t mean to.

To deal with this issue, programmers long ago developed local VCSs that had a simple database that kept all the changes to files under revision control.
[root@s166 fantj]# vim spark.txt
[root@s166 fantj]# hadoop fs -mkdir -p /spark
[root@s166 fantj]# hadoop fs -put spark.txt /spark
[root@s166 fantj]# hadoop fs -ls -R /spark
-rw-r--r--   3 root supergroup       1527 2018-07-30 23:12 /spark/spark.txt

1.2 開啓shell

[root@s166 fantj]# spark-shell 
18/07/31 04:53:52 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18/07/31 04:53:55 INFO spark.SecurityManager: Changing view acls to: root
18/07/31 04:53:55 INFO spark.SecurityManager: Changing modify acls to: root
18/07/31 04:53:55 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root)
18/07/31 04:53:58 INFO spark.HttpServer: Starting HTTP Server
18/07/31 04:53:59 INFO server.Server: jetty-8.y.z-SNAPSHOT
18/07/31 04:53:59 INFO server.AbstractConnector: Started [email protected]:36422
18/07/31 04:53:59 INFO util.Utils: Successfully started service 'HTTP class server' on port 36422.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.5.1
      /_/

...
...
18/07/31 04:57:28 INFO session.SessionState: Created HDFS directory: /tmp/hive/root/6814e7a5-b896-49ac-bcd8-0b94e1a4b165/_tmp_space.db
18/07/31 04:57:30 INFO repl.SparkILoop: Created sql context (with Hive support)..
SQL context available as sqlContext.

scala> 

1.3 執行scala程序

sc.textFile("/spark/spark.txt").flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_).saveAsTextFile("/spark/out")

sc是SparkContext對象,該對象是提交spark程序的入口
textFile("/spark/spark.txt")是hdfs中讀取數據
flatMap(_.split(" "))先map再壓平
map((_,1))將單詞和1構成元組
reduceByKey(_+_)按照key進行reduce,並將value累加
saveAsTextFile("/spark/out")將結果寫入到hdfs中
scala> sc.textFile("/spark/spark.txt").flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_).saveAsTextFile("/spark/out")
18/07/31 04:59:40 INFO storage.MemoryStore: ensureFreeSpace(57160) called with curMem=0, maxMem=560497950
18/07/31 04:59:40 INFO storage.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 55.8 KB, free 534.5 MB)
18/07/31 04:59:44 INFO storage.MemoryStore: ensureFreeSpace(17347) called with curMem=57160, maxMem=560497950
18/07/31 04:59:44 INFO storage.MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 16.9 KB, free 534.5 MB)
18/07/31 04:59:44 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:37951 (size: 16.9 KB, free: 534.5 MB)
18/07/31 04:59:44 INFO spark.SparkContext: Created broadcast 0 from textFile at <console>:22
18/07/31 04:59:49 INFO mapred.FileInputFormat: Total input paths to process : 1
18/07/31 04:59:53 INFO Configuration.deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
18/07/31 04:59:53 INFO Configuration.deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
18/07/31 04:59:53 INFO Configuration.deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
18/07/31 04:59:53 INFO Configuration.deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
18/07/31 04:59:53 INFO Configuration.deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
18/07/31 05:00:51 INFO executor.Executor: Finished task 0.0 in stage 1.0 (TID 1). 1165 bytes result sent to driver
18/07/31 05:00:51 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 1.0 (TID 1) in 4730 ms on localhost (1/1)
18/07/31 05:00:51 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool 
18/07/31 05:00:51 INFO scheduler.DAGScheduler: ResultStage 1 (saveAsTextFile at <console>:22) finished in 4.733 s
18/07/31 05:00:52 INFO scheduler.DAGScheduler: Job 0 finished: saveAsTextFile at <console>:22, took 15.399221 s

1.4 查看執行結果

[root@s166 ~]# hadoop fs -cat /spark/out/p*
(simple,,1)
(nearly,1)
(For,1)
(back,2)
(this,4)
(under,1)
(it,2)
(means,1)
(introduced,1)
(revision,1)
(when,,1)
...
...
(To,1)
((which,1)
...
(prone.,1)
(an,2)
(time,,1)
(things,1)
(they’re,1)
...
(might,1)
(would,1)
(issue,,1)
(state,,2)
(Systems,1)
(System,1)
(write,1)
(being,1)
(programmers,1)

實現二:java 本地執行處理

主要用於臨時性的處理。

pom.xml

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>com.fantj</groupId>
    <artifactId>bigdata</artifactId>
    <version>0.0.1-SNAPSHOT</version>
    <packaging>jar</packaging>

    <name>bigdata</name>
    <description>Demo project for Spring Boot</description>

    <parent>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-parent</artifactId>
        <version>2.0.4.RELEASE</version>
        <relativePath/> <!-- lookup parent from repository -->
    </parent>

    <properties>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
        <project.reporting.outputEncoding>UTF-8</project.reporting.outputEncoding>
        <java.version>1.8</java.version>
    </properties>

    <dependencies>
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter</artifactId>
        </dependency>

        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-test</artifactId>
            <scope>test</scope>
        </dependency>

        <dependency>
            <groupId>mysql</groupId>
            <artifactId>mysql-connector-java</artifactId>
        </dependency>
        <!--<dependency>-->
            <!--<groupId>org.apache.hive</groupId>-->
            <!--<artifactId>hive-jdbc</artifactId>-->
            <!--<version>2.1.0</version>-->
        <!--</dependency>-->
        <!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-common -->
        <!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-common -->
        <dependency>
            <groupId>ch.cern.hadoop</groupId>
            <artifactId>hadoop-common</artifactId>
            <version>2.7.5.1</version>
            <classifier>sources</classifier>
            <type>java-source</type>
        </dependency>
        <dependency>
            <groupId>ai.h2o</groupId>
            <artifactId>sparkling-water-core_2.10</artifactId>
            <version>1.6.1</version>
            <type>pom</type>
        </dependency>

        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-web</artifactId>
            <!-- 去除內嵌tomcat -->
            <exclusions>
                <exclusion>
                    <groupId>org.springframework.boot</groupId>
                    <artifactId>spring-boot-starter-tomcat</artifactId>
                </exclusion>
            </exclusions>
        </dependency>
        <!--添加servlet的依賴-->
        <dependency>
            <groupId>javax.servlet</groupId>
            <artifactId>javax.servlet-api</artifactId>
            <version>3.1.0</version>
            <scope>provided</scope>
        </dependency>
    </dependencies>
    <build>
        <finalName>wordcount</finalName>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-jar-plugin</artifactId>
                <version>2.4</version>
                <configuration>
                    <archive>
                        <manifest>
                            <mainClass>com.fantj.bigdata.WordCountCluster</mainClass>
                        </manifest>
                    </archive>
                </configuration>
            </plugin>
        </plugins>
    </build>


</project>

我用的springboot搭建的環境,所以pom中需要將springboot內置的tomcat移除,我們不需要容器來執行java腳本。最後打成jar包將main方法的路徑告訴hadoop即可,不需要容器。然後就是導入hadoop spark的相關依賴。沒maven基礎的先學習maven。
WordCountLocal

package com.fantj.bigdata;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.api.java.function.Function2;
import org.apache.spark.api.java.function.PairFunction;
import org.apache.spark.api.java.function.VoidFunction;
import scala.Tuple2;

import java.util.Arrays;

/**
 * Created by Fant.J.
 */
public class WordCountLocal {
    public static void main(String[] args) {
        /**
         * 創建sparkConfig對象,設置spark應用配置信息,
            setMaster設置spark應用程序要連接的spark集羣的master節點的url, 
            local則代表在本地運行
         */
        SparkConf conf = new SparkConf().setAppName("WordCountLocal").setMaster("local");
        /**
         * 創建JavaSparkContext 對象(最重要的對象)
         *
         */
        JavaSparkContext sc = new JavaSparkContext(conf);
        /**
         * 針對輸入源(hdfs文件、本地文件等)創建一個初始的RDD,這裏是本地測試,所以就針對本地文件
         * textFile()方法用於根據文件類型的輸入源創建RDD
         * RDD的概念:如果是hdfs或者本地文件,創建的RDD每個元素就相當於是文件裏的一行。
         */
        JavaRDD<String> lines = sc.textFile("C:\\Users\\84407\\Desktop\\spark.txt");
        /**
         * 對初始RDD進行transformation 計算操作
         * 通常操作會通過創建function,並配合RDD的map、flatMap等算子來執行
         * function通常,如果比較簡單,則創建指定function的匿名內部類
         * 如果function比較複雜,則會單獨創建一個類,作爲實現這個function接口的類
         * 現將每一行拆分成單個的單詞
         * FlatMapFunction,有兩個泛型闡述,分別代表了輸入和輸出類型
         * 這裏只用FLatMap算子的作用,其實就是講RDD的一個元素,給拆分成一個或多個元素。
         */
        JavaRDD<String> words = lines.flatMap(new FlatMapFunction<String, String>() {
            @Override
            public Iterable<String> call(String s) throws Exception {
                //分割
                return Arrays.asList(s.split(" "));
            }
        });
        /**
         * 接着,需要將每一個單詞,映射爲(單詞,1)的這種格式來進行每個單詞的出現次數的累加
         * mapTopair其實就是將每個元素以後干涉爲一個(v1,v2)這樣的Tuple2類型的元素
         * 如果還記得scala的tuple,那麼沒錯,這裏的tuple2就是scala類型,包含了兩個值
         * mapToPair這個算子,要求的是與PairFunction配合使用,第一給泛型參數代表了 輸入類型
         * 第二個和第三個泛型參數,代表的輸出的Tuple2的第一給值和第二個值的類型
         * JavaPairRdd的兩個泛型參數,分別代表了tuple元素的第一給值和第二個值的類型
         */
        JavaPairRDD<String,Integer> pairs = words.mapToPair(new PairFunction<String, String, Integer>() {
            @Override
            public Tuple2<String, Integer> call(String word) throws Exception {
                return new Tuple2<String,Integer>(word,1);
            }
        });
        /**
         * 然後需要以單詞作爲key,統計每個單詞出現的次數
         * 這裏要使用reduceBykey這個算子對每個key對應的value都進行reduce操作
         * 比如JavaPairRDD中有幾個元素,假設分別爲(hello,1)(hello,1)(hello,1)
         * reduce操作相當於是吧第一個值和第二個值進行計算,然後再講結果與第三個至進行計算
         * 比如這裏的helo,那麼就相當於是,1+1=2,然後2+1=3
         * 最後返回的JavaPairRdd中的元素,也是tuple,但是第一個值就是每個key,第二個值就是key的value,也就是次數
         */
        JavaPairRDD<String,Integer> wordCounts = pairs.reduceByKey(new Function2<Integer, Integer, Integer>() {
            @Override
            public Integer call(Integer v1, Integer v2) throws Exception {
                return v1+v2;
            }
        });
        /**
         * 我們已經統計出了單詞的次數
         * 但是,之前我們使用的flatMap、mapToPair、reduceByKey這種操作,都叫做transformation操作
         * 一個Spark應用中,只有transformation操作是不行的,我用foreach來觸發程序的執行
         */
        wordCounts.foreach(new VoidFunction<Tuple2<String, Integer>>() {
            @Override
            public void call(Tuple2<String, Integer> wordCount) throws Exception {
                System.out.println(wordCount._1 + "appeared "+ wordCount._2 );
            }
        });
    }
}

我們可以看到裏面有很多的匿名內部類,我們可以用lambda將它代替,使代碼更簡潔。

    public static void main(String[] args) {

        SparkConf conf = new SparkConf().setAppName("WordCountLocal").setMaster("local");

        JavaSparkContext sc = new JavaSparkContext(conf);

        JavaRDD<String> lines = sc.textFile("C:\\Users\\84407\\Desktop\\spark.txt");

        JavaRDD<String> words = lines.flatMap((FlatMapFunction<String, String>) s -> {
            //分割
            return Arrays.asList(s.split(" "));
        });

        JavaPairRDD<String, Integer> pairs = words.mapToPair((PairFunction<String, String, Integer>) word -> new Tuple2<>(word, 1));

        JavaPairRDD<String, Integer> wordCounts = pairs.reduceByKey((Integer v1, Integer v2) -> {
            return v1 + v2;
        });

        wordCounts.foreach((Tuple2<String, Integer> wordCount) -> {
            System.out.println(wordCount._1 + "appeared " + wordCount._2);
        });
    }

然後執行該main方法:

控制檯打印:
Systemsappeared 1
examplesappeared 1
withappeared 2
inappeared 3
specificappeared 1
versionsappeared 1
recallappeared 1
copyappeared 2
Inappeared 1
VCSsappeared 1
controlled,appeared 1
Whatappeared 1
directory,appeared 1
Manyappeared 1
setappeared 1
loseappeared 1
...
...
systemappeared 1
Systemappeared 1
writeappeared 1
beingappeared 1
programmersappeared 1

實現三:集羣執行

最常用,主要可以針對HDFS上存儲的大數據並進行離線批處理。

準備工作:
在這之前,需要將spark.txt文本上傳到hdfs上。

3.1 修改代碼

如果要在集羣上執行,需要修改兩個地方的代碼:

        SparkConf conf = new SparkConf().setAppName("WordCountCluster");

        JavaSparkContext sc = new JavaSparkContext(conf);

        JavaRDD<String> lines = sc.textFile("hdfs://s166/spark/spark.txt");

setAppName和java類名相一致。然後把路徑改成hdfs的文件路徑。

3.2 Maven打包

需要將第二種實現方式的java項目打包成jar,然後放到集羣中,通過腳本執行。

3.3 上傳到集羣
3.4 寫執行腳本wordcount.sh
[root@s166 fantj]# cat wordcount.sh 

/home/fantj/spark/bin/spark-submit \
--class com.fantj.bigdata.WordCountCluster \
s--num-executors 1 \
--driver-memory 100m \
--executor-cores 1 \
/home/fantj/worldcount.jar \
3.5 執行腳本

./wordcount.sh

[root@s166 fantj]# ./wordcount.sh 
18/07/31 09:43:49 INFO spark.SparkContext: Running Spark version 1.5.1
18/07/31 09:43:51 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18/07/31 09:43:52 INFO spark.SecurityManager: Changing view acls to: root
18/07/31 09:43:52 INFO spark.SecurityManager: Changing modify acls to: root
18/07/31 09:43:52 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root)
18/07/31 09:43:54 INFO slf4j.Slf4jLogger: Slf4jLogger started
18/07/31 09:43:54 INFO Remoting: Starting remoting
18/07/31 09:43:55 INFO util.Utils: Successfully started service 'sparkDriver' on port 41710.
18/07/31 09:43:55 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://[email protected]:41710]
18/07/31 09:43:55 INFO spark.SparkEnv: Registering MapOutputTracker
18/07/31 09:43:55 INFO spark.SparkEnv: Registering BlockManagerMaster
18/07/31 09:43:55 INFO storage.DiskBlockManager: Created local directory at /tmp/blockmgr-96c433c9-8f43-40fa-ba4f-1dc888608140
18/07/31 09:43:55 INFO storage.MemoryStore: MemoryStore started with capacity 52.2 MB
18/07/31 09:43:55 INFO spark.HttpFileServer: HTTP File server directory is /tmp/spark-5d613c5d-e9c3-416f-8d8b-d87bc5e03e02/httpd-3609b712-55f4-4140-9e05-2ecee834b18c
18/07/31 09:43:55 INFO spark.HttpServer: Starting HTTP Server
..
...
18/07/31 09:44:12 INFO storage.ShuffleBlockFetcherIterator: Started 0 remote fetches in 4 ms
simple,appeared 1
nearlyappeared 1
Forappeared 1
backappeared 2
thisappeared 4
underappeared 1
itappeared 2
meansappeared 1
introducedappeared 1
revisionappeared 1
when,appeared 1
previousappeared 2
realityappeared 1
typeappeared 1
developedappeared 1
Localappeared 1
simpleappeared 1
...
causingappeared 1
changesappeared 3
andappeared 5
designerappeared 1
approachappeared 1
modifiedappeared 1
systemappeared 1
Systemappeared 1
writeappeared 1
beingappeared 1
programmersappeared 1
18/07/31 09:44:12 INFO executor.Executor: Finished task 0.0 in stage 1.0 (TID 1). 1165 bytes result sent to driver
18/07/31 09:44:12 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 1.0 (TID 1) in 200 ms on localhost (1/1)
18/07/31 09:44:12 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool 
18/07/31 09:44:12 INFO scheduler.DAGScheduler: ResultStage 1 (foreach at WordCountCluster.java:44) finished in 0.209 s
18/07/31 09:44:12 INFO scheduler.DAGScheduler: Job 0 finished: foreach at WordCountCluster.java:44, took 2.938418 s
18/07/31 09:44:12 INFO spark.SparkContext: Invoking stop() from shutdown hook
...
..
18/07/31 09:44:13 INFO remote.RemoteActorRefProvider$RemotingTerminator: Remoting shut down.
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章