hadoop(wordcount案例實操)

需求：

在給定的文本文件中統計輸出每一個單詞出現的總次數

輸入數據

zhou zhi xiong
duan xing yu
zhou xiong xiong

期望輸出數據

zhou 2
zhi 1
xiong 3
duan 1
xing 1
yu 1

需求分析

按照MapReduce編程規範，分別編寫Mapper，Reducer，Driver

Mapper

將MapTask傳給我們的文本內容先轉換成String

根據空格將這一行切分爲成單詞

將單詞輸出爲<單詞，1>

Reducer

彙總各個key的個數

輸出各個key的總次數

Driver（固定套路）

獲取配置信息，獲取job對象實例

設置類路徑

關聯Mapper/Reducer業務類

指定Mapper輸出數據的kv類型

指定最終輸出的數據的kv類型

指定job的輸入原始文件所在目錄

指定job的輸出結果所在目錄

提交job作業

環境準備

創建maven工程

點擊next

點擊next，最後點擊finshed

在pom.xml文件中添加如下依賴

<dependencies>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>RELEASE</version>
</dependency>
<dependency>
<groupId>org.apache.logging.log4j</groupId>
<artifactId>log4j-core</artifactId>
<version>2.8.2</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>2.7.2</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>2.7.2</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-hdfs</artifactId>
<version>2.7.2</version>
</dependency>
<dependency>
<groupId>jdk.tools</groupId>
<artifactId>jdk.tools</artifactId>
<version>1.8</version>
<scope>system</scope>
<systemPath>${JAVA_HOME}/lib/tools.jar</systemPath>
</dependency>
</dependencies>

在項目的src/main/resources目錄下，新建一個文件，命名爲“log4j.properties”

log4j.rootLogger=INFO, stdout
log4j.appender.stdout=org.apache.log4j.ConsoleAppender
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
log4j.appender.stdout.layout.ConversionPattern=%d %p [%c] - %m%n
log4j.appender.logfile=org.apache.log4j.FileAppender
log4j.appender.logfile.File=target/spring.log
log4j.appender.logfile.layout=org.apache.log4j.PatternLayout
log4j.appender.logfile.layout.ConversionPattern=%d %p [%c] - %m%n

在java包底下新建名爲com.redhat.wordcount的包

在com.redhat.wordcount包下建立Wcmapper、WcReducer、WcDriver三個類

編寫程序

編寫Mapper類

//繼承父類Mapper
public class Wcmapper extends Mapper<LongWritable, Text,Text, IntWritable> {
    //longwritable是該行在文件中的偏移量 標明該行在文件的哪裏
    //Text是指該行的內容
    //第二個Text是指我們想讓輸出的形式，我們要輸出的形式是(word,1)，word的話類型就是Text
    //輸出的vaule值“1”的類型是IntWritable

    //mapper中要複寫父類的map方法，快捷鍵 control+o可以重寫方法
     private Text word = new Text();
     private IntWritable one = new IntWritable(1);
    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        //context相當於mapreduce框架的任務線

        // 拿到這一行數據
        String line = value.toString();

        //按照空格切分數據
        String[] words = line.split(" ");

        //遍歷數組，把單詞變成<word,1>的形式交給框架context
        for (String word : words) { //遍歷數組，（快捷輸入iter）
           this.word.set(word);
           context.write(this.word,this.one);
        }

    }
}

編寫Reducer

//繼承父類Reducer
public class WcReducer extends Reducer<Text, IntWritable,Text,IntWritable> {
    //輸入的範型爲map輸出的<word,1>,輸出的範型爲<word,n>

   private IntWritable total = new IntWritable();

    //重寫父類方法Reduce
    @Override
    protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
        //做累加
         int sum = 0;
        for (IntWritable value : values) {
            sum+=value.get();
        }

        //包裝結果並輸出給context
        total.set(sum);
        context.write(key,total);
    }
}

編寫Driver驅動類

public class WcDriver {
    //Driver是對map和reduce等任務進行設置，Driver的實現都是有套路的

    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
         //1.獲取一個Job實例
        Job job = Job.getInstance(new Configuration()); //job就是mapreduce的主線

        //2.設置類路徑(CLassPath)
        job.setJarByClass(WcDriver.class);

        //3.設置mapper和reducer
        job.setMapperClass(Wcmapper.class);
        job.setReducerClass(WcReducer.class);

        //4.設置Mapper和Reducer的輸出類型，因爲在Mapper和Reducer中輸出類型都是自定義的，框架並不知道
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(IntWritable.class);

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        //5.設置輸入輸出數據源
        FileInputFormat.setInputPaths(job,new Path(args[0])); //args[0] java程序的第一個參數
        FileOutputFormat.setOutputPath(job,new Path(args[1]));

        //6.提交job
        boolean b = job.waitForCompletion(true);
        //布爾值表示job執行成功還是失敗
        System.exit(b?0:1);
    }
}

本地測試

正常運行會報錯，因爲沒有輸入源和輸出源

點擊idea右上角的Edit Configurations

設置輸入源是C盤的input文件夾輸出源是C盤的output文件夾

再次點擊運行，wordcount程序運行成功

前兩個crc文件是用於校驗的

打開part-r-0000文件顯示的計算結果

集羣上測試

找到dea右側邊，Maven projects–>mapreduce–>Lifecycle，雙擊package

生成target文件夾，將裏面的jar文件拷貝到linux的hadoop集羣裏,重名名爲1.jar

上傳本地測試文件README.txt

[redhat@hadoop102 hadoop-2.7.2]$ hadoop fs -put README.txt /

運行1.jar文件來統計README.txt文件中的單詞數

拷貝驅動類的引用，得到:com.redhat.wordcount.WcDriver，這相當於程序的入口地址

hadoop中執行wordcount程序

[redhat@hadoop102 hadoop-2.7.2]$ hadoop jar 1.jar com.redhat.wordcount.WcDriver /README.txt /wordcount

查看README.txt單詞數目的統計結果