hadoop生態系統學習之路（二）如何編寫MR以及運行測試

最近一直太忙，都沒時間寫博客了。首先是平時需要帶我的一個哥們，他底子比我稍弱，於是我便從mybatis、spring、springMVC、html、css、js、jquery一個一個的教他，在教的過程中筆者也發現了很多之前自己沒有弄明白的問題，所以說想把一樣東西學好並不容易。另外筆者也參與了公司的大數據項目，學會怎麼寫一個MR，以及hdfs、hbase、hive、impala、zookeeper的基本使用，今天就與大家分享一下MR的編寫，之後的博文中再與大家一一進行分享。當然，大數據相關的東西實在太多了，也不可能都會使用，並且用得很深，所以筆者也會再接再厲。同時，由於週末筆者還要學駕照，所以真是身心疲憊，但是也是對自己的鍛鍊。
好了，不說廢話了，直入正題。
首先，筆者給大家介紹一下這個MR的大致業務：其實，就是一個etl過程，對數據進行抽取、轉換以及加載到目的端，這裏目的端，既可以是hdfs，然後交給下一個MR進行處理，也可以是hbase數據倉庫，還可以是hive或者imapla的數據庫，這裏面hive和impala的數據還可以進行同步。這個MR是從ftp上拉取文件，直接存到hdfs，然後經過MR將數據存到hdfs中，提供給另一個MR進行處理。爲了介紹簡單，這裏筆者將從ftp上拉取數據的過程改爲直接從hdfs上讀取。關於如果從ftp上拉取文件直接存到hdfs，後面的博文筆者再進行介紹。
好了，筆者將分以下幾步進行講解：

一、文件以及maven環境準備

這裏，筆者使用的maven依賴，所有hadoop相關的包通過dependency依賴，pom.xml如下：

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>
  <groupId>org.qiyongkang</groupId>
  <artifactId>mr-demo</artifactId>
  <version>0.0.1-SNAPSHOT</version>
  <name>mr-demo</name>
  <description>mr-demo</description>
  <packaging>jar</packaging>

  <repositories>
      <!-- 注意，這裏使用cloudera公司的maven倉庫 -->
      <repository>
        <id>cloudera</id>
        <url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
      </repository>  
  </repositories>

  <properties>
    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
    <!-- hadoop版本 -->
    <hadoop.version>2.3.0-cdh5.0.0</hadoop.version>
    <!-- hbase版本 -->
    <hbase.version>0.96.1.1-cdh5.0.0</hbase.version>
    <!-- hive版本 -->
    <hive.version>0.12.0-cdh5.0.0</hive.version>
    <!-- junit版本 -->
    <junit.version>4.8.1</junit.version>
  </properties>

  <dependencies>
      <!-- hadoop相關依賴 -->
      <dependency>
          <groupId>org.apache.hadoop</groupId>
          <artifactId>hadoop-mapreduce-client-core</artifactId>
          <version>${hadoop.version}</version>
          <exclusions>
              <exclusion>
                  <artifactId>jdk.tools</artifactId>
                  <groupId>jdk.tools</groupId>
              </exclusion>
          </exclusions>
      </dependency>

      <dependency>
          <groupId>org.apache.hadoop</groupId>
          <artifactId>hadoop-common</artifactId>
          <version>${hadoop.version}</version>
      </dependency>

      <dependency>
          <groupId>org.apache.hadoop</groupId>
          <artifactId>hadoop-hdfs</artifactId>
          <version>${hadoop.version}</version>
      </dependency>

      <dependency>
          <groupId>org.apache.hadoop</groupId>
          <artifactId>hadoop-client</artifactId>
          <version>${hadoop.version}</version>
          <exclusions>
              <exclusion>
                  <artifactId>mockito-all</artifactId>
                  <groupId>org.mockito</groupId>
              </exclusion>
          </exclusions>
      </dependency>

      <!-- MRUnit相關依賴 -->
      <dependency>
    <groupId>org.apache.mrunit</groupId>
    <artifactId>mrunit</artifactId>
    <version>0.9.0-incubating</version>
    <classifier>hadoop2</classifier> 
</dependency>

<!-- junit依賴 -->
   <dependency>
     <groupId>junit</groupId>
     <artifactId>junit</artifactId>
     <version>${junit.version}</version>
     <scope>test</scope>
   </dependency>
  </dependencies>

  <build>
    <!-- 這是一個打可執行jar的插件，沒有將依賴打進去，執行package命令即可 -->
    <plugins>
      <plugin>
       <groupId>org.apache.maven.plugins</groupId>
       <artifactId>maven-jar-plugin</artifactId>
       <version>2.4</version>
       <configuration>
         <archive>
            <manifest>
              <addClasspath>false</addClasspath>
              <classpathPrefix>lib/</classpathPrefix>
              <mainClass>org.qiyongkang.mr.parsetofivele.ParseDataToFileElementMR</mainClass>
            </manifest>
          </archive>
       </configuration>
      </plugin>

      <!-- 此插件用於將依賴jar全部打到一個jar包裏面去，以免在hadoop運行環境添加依賴包 -->
      <plugin>
          <groupId>org.apache.maven.plugins</groupId>
          <artifactId>maven-assembly-plugin</artifactId>
          <version>2.3</version>
          <configuration>
              <descriptorRefs>
                  <descriptorRef>jar-with-dependencies</descriptorRef>
              </descriptorRefs>
              <archive>
                <manifest>
                    <addClasspath>false</addClasspath>
                    <mainClass>org.qiyongkang.mr.parsetofivele.ParseDataToFileElementMR</mainClass>
                </manifest>
              </archive>
          </configuration>
          <executions>
              <execution>
                  <id>make-assembly</id>
                  <phase>package</phase>
                  <goals>
                      <goal>assembly</goal>
                  </goals>
              </execution>
          </executions>
      </plugin>

      <!-- 拷貝依賴包 -->
      <plugin>
          <groupId>org.apache.maven.plugins</groupId>
          <artifactId>maven-dependency-plugin</artifactId>
          <executions>
              <execution>
                  <id>copy-dependencies</id>
                  <phase>package</phase>
                  <goals>
                      <goal>copy-dependencies</goal>
                  </goals>
                  <configuration>
                      <outputDirectory>${project.build.directory}/lib</outputDirectory>
                      <overWriteReleases>false</overWriteReleases>
                      <overWriteSnapshots>false</overWriteSnapshots>
                      <overWriteIfNewer>true</overWriteIfNewer>
                  </configuration>
              </execution>
          </executions>
      </plugin>
    </plugins>
  </build>
</project>

然後，我們準備一份文件，格式如下：

202.102.224.68|53|61.158.148.103|17872|22640|p.tencentmind.com|A|A_125.39.213.86|20160308100839.993|0|r
202.102.224.68|53|61.158.152.97|20366|27048|api.k.sohu.com|A|A_123.126.104.116;A_123.126.104.119;A_123.126.104.114;A_123.126.104.117;A_123.126.104.118;A_123.126.104.120;A_123.126.104.115;A_123.126.104.122|20160308100839.993|0|r
115.60.53.151|7582|202.102.224.68|53|33946|cip4.e1977.com|A||20160308100839.993|0|q
182.119.224.59|14731|202.102.224.68|53|31185|s.jpush.cn|A||20160308100839.993|0|q
202.102.224.68|53|182.118.77.145|22420|19278|file32.mafengwo.net|A|A_182.118.77.145|20160308100839.993|0|r
202.102.224.68|53|115.60.14.138|22929|31604|mmbiz.qpic.cn|A|A_42.236.95.35;A_42.236.95.36;A_42.236.95.34;A_182.118.63.200;A_182.118.63.196;A_42.236.95.33;A_42.236.95.37|20160308100839.993|0|r
115.60.109.162|3760|202.102.224.68|53|8920|a.root-servers.net|A||20160308100839.993|0|q

每一行以|分隔，然後r或者q結尾，這裏我們的MR只會取r結尾的數據，並且只會取此行的某幾列數據，然後以其中三行爲key進行計數，作爲reducer的輸入，最後將結果寫入到hdfs，這樣便可極大的祛除無效數據，減小文件大小。
這裏，筆者準備了一個1.9大小.txt文件，如：

上面的jar就是後面我們要在yarn上執行的包。
然後，執行：

su hdfs

使用hdfs用戶。因爲這裏筆者使用的生態系統環境就是上一篇博文中使用cm搭建的環境。cm會爲hdfs創建一個hdfs用戶，所以我們必須使用此用戶進行hdfs的相關操作。
執行以下命令，將文件上傳到hdfs的/test/input目錄：

hadoop fs -put testData.txt /test/input

執行hadoop fs -ls /test/input可看到上傳到hdfs成功：

二、Mapper類編寫

Mapper類ParseDataToFileElementMapper：

public static class ParseDataToFileElementMapper extends Mapper<Object, Text, Text, IntWritable> {

        private static final IntWritable one = new IntWritable(1);
        private Text mapKey = new Text();

        @Override
        protected void map(Object key, Text value, Mapper<Object, Text, Text, IntWritable>.Context context)
                throws IOException, InterruptedException {
            String[] values = value.toString().split("\\|");

            if ("r".equals(values[10])) {

                mapKey.set(values[5] + "\t" + values[0] + "\t" + values[2]);
                System.out.println(mapKey.toString());
                context.write(mapKey, one);
            }
        }

    }

這裏，由於代碼不多，筆者將Mapper和Reducer作爲內部類，大家可以抽離出來。

三、Reducer類編寫

Reducer類ParseDataToFileElementReducer：

public static class ParseDataToFileElementReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
        private Text reduceKey = new Text();
        private IntWritable result = new IntWritable();

        @Override
        protected void reduce(Text key, Iterable<IntWritable> values, Reducer<Text, IntWritable, Text, IntWritable>.Context context)
                throws IOException, InterruptedException {
            //把key相同的統計一下次數
            //cname + topDomain + cip + dip
            int sum = 0;
            for (IntWritable val : values) {
              sum += val.get();
            }
            this.result.set(sum);
            this.reduceKey.set("1.1-1.1" + "\t" + key.toString());

            context.write(this.reduceKey, this.result);
        }

    }

這裏，mapper會將txt數據一行行讀取解析，經過shuffle後，會對key進行哈希，然後將相同的key交給一個Reducer，然後reducer對相同key進行計數，寫入hdfs。

四、main函數調用MR

主類ParseDataToFileElementMR：

public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
        Configuration conf = new Configuration();
        String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
        if (otherArgs.length != 2) {
          System.err.println("Usage: ParseDataToFileElementMR <in> <out>");
          System.exit(2);
        }
        Job job = Job.getInstance(conf, "ParseDataToFileElementMR");
        job.setJarByClass(ParseDataToFileElementMR.class);
        //Mapper
        job.setMapperClass(ParseDataToFileElementMapper.class);

        //Combiner
//        job.setCombinerClass(ParseDataToFileElementReducer.class);

        //Reducer
        job.setReducerClass(ParseDataToFileElementReducer.class);
        job.setNumReduceTasks(10);

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
        FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));

        //將reduce輸出文件壓縮.gz
        FileOutputFormat.setCompressOutput(job, true);  //job使用壓縮  
        FileOutputFormat.setOutputCompressorClass(job, GzipCodec.class); //設置壓縮格式

        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }

這裏我們指定reducer個數爲1個，並指定輸出格式爲.gz。

五、編寫MRUnit測試

接下來，我們使用MRUnit對MR進行測試，相關的jar依賴在第一步pom文件已給出，直接貼出測試代碼，和junit一樣執行：

package org.qiyongkang.mr.parsetofivele;

import java.util.ArrayList;
import java.util.List;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mrunit.mapreduce.MapDriver;
import org.apache.hadoop.mrunit.mapreduce.MapReduceDriver;
import org.apache.hadoop.mrunit.mapreduce.ReduceDriver;
import org.junit.Before;
import org.junit.Test;
import org.qiyongkang.mr.parsetofivele.ParseDataToFileElementMR.ParseDataToFileElementMapper;
import org.qiyongkang.mr.parsetofivele.ParseDataToFileElementMR.ParseDataToFileElementReducer;

/**
 * ClassName:ParseDataToFileElementMRTest <br/>
 * Function: TODO ADD FUNCTION. <br/>
 * Reason: TODO ADD REASON. <br/>
 * Date: 2016年3月15日 下午12:04:55 <br/>
 * 
 * @author qiyongkang
 * @version
 * @since JDK 1.6
 * @see
 */
public class ParseDataToFileElementMRTest {

    MapDriver<Object, Text, Text, IntWritable> mapDriver;
    ReduceDriver<Text, IntWritable, Text, IntWritable> reduceDriver;
    MapReduceDriver<Object, Text, Text, IntWritable, Text, IntWritable> mapReduceDriver;

    @Before
    public void setUp() throws Exception {
        ParseDataToFileElementMapper mapper = new ParseDataToFileElementMapper();
        ParseDataToFileElementReducer reducer = new ParseDataToFileElementReducer();
        mapDriver = MapDriver.newMapDriver(mapper);
        reduceDriver = ReduceDriver.newReduceDriver(reducer);
        mapReduceDriver = MapReduceDriver.newMapReduceDriver(mapper, reducer);
    }

    @Test
    public void testMapper() {
        mapDriver.withInput(new Object(), new Text(
                "202.102.224.68|53|115.60.109.162|3760|8920|a.root-servers.net|A|A_198.41.0.4|20160308100839.993|0|r"));
        mapDriver.withOutput(new Text("a.root-servers.net\t202.102.224.68\t115.60.109.162"), new IntWritable(1));
        mapDriver.runTest();
    }

    @Test
    public void testReducer() {
        List<IntWritable> values = new ArrayList<IntWritable>();
        values.add(new IntWritable(1));
        values.add(new IntWritable(1));
        reduceDriver.withInput(new Text("a.root-servers.net\t202.102.224.68\t115.60.109.162"), values);
        reduceDriver.withOutput(new Text("1.1-1.1\ta.root-servers.net\t202.102.224.68\t115.60.109.162"),
                new IntWritable(2));
        reduceDriver.runTest();
    }

    @Test
    public void testMapReducer() {
        mapReduceDriver.withInput(new Object(), new Text(
                "202.102.224.68|53|115.60.109.162|3760|8920|a.root-servers.net|A|A_198.41.0.4|20160308100839.993|0|r"));
        List<IntWritable> values = new ArrayList<IntWritable>();
        values.add(new IntWritable(1));
        mapReduceDriver.withOutput(new Text("1.1-1.1\ta.root-servers.net\t202.102.224.68\t115.60.109.162"), new IntWritable(1));
        mapReduceDriver.runTest();
    }

}

這裏我們可以對文件的單行進行測試，因爲mapper本來就類似bufferedReader對文件一行行的讀取。

六、打包

這裏，筆者使用maven提供的插件進行打包，已在pom文件寫出。然後，爲了不將依賴包拷到hadoop環境，我們採用jar-with-dependencies這種打包方式，筆者對mr-demo-0.0.1-SNAPSHOT-jar-with-dependencies.jar反編譯如下：

同時也指定了main函數所在類，大家可以看下pom文件。

七、在yarn上執行（MR2）

MR已寫完，下面我們便可以在yarn上執行了。由於hadoop1.x使用的是MR1，而yarn上已經包括了MR2了，關於MR1與MR2的區別，筆者在後面的博文中會進行介紹。
下面開始執行：

yarn jar mr-demo-0.0.1-SNAPSHOT-jar-with-dependencies.jar /test/input /test/output

這裏，我們的輸入文件格式是使用的.txt，其實hdfs還支持壓縮格式以及其它的格式，後面再進行介紹。
然後，我們在hdfs上查看下輸出目錄：

這裏由於reducer只指定了一個，所以只有一個輸出文件。
我們把此文件get到本地，解壓看看：

八、查看運行結果以及日誌

這裏，我們訪問http://massdata8:19888/jobhistory，JobHistory Server的默認端口便可查看MR運行日誌：

同時，也可以運行yarn application -list，查看正在運行的job。

好了，關於MR的編寫就講到這兒了，希望給剛學hadoop的童鞋提供點幫助，另外，大家也可以看看hadoop提供的mr example，學會如何寫一個基本的mr。

hadoop生態系統學習之路（二）如何編寫MR以及運行測試

一、文件以及maven環境準備

二、Mapper類編寫

三、Reducer類編寫

四、main函數調用MR

五、編寫MRUnit測試

六、打包

七、在yarn上執行（MR2）

八、查看運行結果以及日誌

PDManer [元數建模]-v4.9.0 發佈：一款簡單好用的數據庫建模平臺

使用neovim打造go ide(支持代碼跳轉, 代碼補全, 實時語法檢查)

sql求連續值問題

cs01 CSS Syntax

挑戰程序設計競賽 2.3章習題 poj 3046 Ant Counting

[MASM拾遺]Offset僞指令

h30 HTML Layout Elements

瞭解顯卡

一款基於C#開發的通訊調試工具（支持Modbus RTU、MQTT調試）

Linux/Golang/glibC系統調用

SpringBoot2.x學習之路(三)JdbcTemplate、Mybatis多數據源配置

SpringBoot2.x學習之路(二)JdbcTemplate以及Mybatis的使用

不追求速度的奮進者（二）

任務調度quartz（一）quartz在spring中的集成

六種常用的設計模式java實現（五）代理模式

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結