Hadoop MapReduce技巧

MAR 19TH, 2013 | COMMENTS

我在使用Hadoop編寫MapReduce程序時，遇到了一些問題，通過在Google上查詢資料，並結合自己對Hadoop的理解，逐一解決了這些問題。

自定義Writable

Hadoop對MapReduce中Key與Value的類型是有要求的，簡單說來，這些類型必須支持Hadoop的序列化。爲了提高序列化的性能，Hadoop還爲Java中常見的基本類型提供了相應地支持序列化的類型，如IntWritable，LongWritable，併爲String類型提供了Text類型。不過，這些Hadoop內建的類型並不足以支持真實遇到的業務。此時，就需要自定義Writable類，使得它既能夠作爲Job的Key或者Value，又能體現業務邏輯。

假設我已經從豆瓣抓取了書籍的數據，包括書籍的Title以及讀者定義的Tag，並以Json格式存儲在文本文件中。現在我希望提取這些數據中我感興趣的內容，例如指定書籍的Tag列表，包括Tag被標記的次數。這些數據可以作爲向量，爲後面的數據分析提供基礎數據。對於Map，我希望讀取Json文件，然後得到每本書的Title，以及對應的單個Tag信息。作爲Map的輸出，我希望是我自己定義的類型BookTag。它只包括Tag的名稱和標記次數：

public class BookTag implements Writable {
    private String name;
    private int count;

    public BookTag() {
        count = 0;
    }

    public BookTag(String name, int count) {
        this.name = name;
        this.count = count;
    }

    @Override
    public void write(DataOutput dataOutput) throws IOException {
        if (dataOutput != null) {
            Text.writeString(dataOutput, name);
            dataOutput.writeInt(count);
        }
    }

    @Override
    public void readFields(DataInput dataInput) throws IOException {
        if (dataInput != null) {
            name = Text.readString(dataInput);
            count = dataInput.readInt();
        }
    }

    public String getName() {
        return name;
    }

    public void setName(String name) {
        this.name = name;
    }

    public int getCount() {
        return count;
    }

    public void setCount(int count) {
        this.count = count;
    }

    @Override
    public String toString() {
        return "BookTag{" +
                "name='" + name + '\'' +
                ", count=" + count +
                '}';
    }
}

注意，在write()與readFields()方法中，對於String類型的處理完全不同於Int、Long等類型，它需要調用Text的相關靜態方法。

針對每本書，Map出來的結果可能包含重複的BookTag信息（指Tag Name相同）；而我需要得到每個Tag的標記總和，以作爲數據分析的向量。因此，作爲Reduce的輸入，可以是<Text, Iterable>，但輸出則應該是合併了相同Tag信息的結果。爲此，我引入了BookTags類，在其內部維持了一個BookTag的Map，它同樣需要實現Writable。由於BookTags包含了一個集合類型，因此它的實現會略有不同：

public class BookTags implements Writable {
    private Map<String, BookTag> tags = new HashMap<String, BookTag>();

    @Override
    public void write(DataOutput dataOutput) throws IOException {
        dataOutput.writeInt(tags.size());
        for (BookTag tag : tags.values()) {
            tag.write(dataOutput);
        }
    }

    @Override
    public void readFields(DataInput dataInput) throws IOException {
        int size = dataInput.readInt();
        for (int i = 0; i < size; i++) {
            BookTag tag = new BookTag();
            tag.readFields(dataInput);
            tags.put(tag.getName(), tag);
        }
    }

    public void add(BookTag tag) {
            String tagName = tag.getName();
            if (tags.containsKey(tagName)) {
                BookTag bookTag = tags.get(tagName);
                bookTag.setCount(bookTag.getCount() + tag.getCount());
            } else {
                tags.put(tagName, tag);
            }
    }

    @Override
    public String toString() {
        StringBuilder resultTags = new StringBuilder();
        for (BookTag tag : tags.values()) {
            resultTags.append(tag.toString());
            resultTags.append("|");
        }
        return resultTags.toString();
    }
}

其實，針對這種嵌套了集合的自定義Writable類型，由於嵌套的類型同樣實現了Writable接口，因而同樣可以調用嵌套類型的write()與readFields()方法，唯一的區別是需要將集合的Size寫入到DataOutput中，以便於在讀取時可以遍歷集合。這實際上是一種Composite模式。

Iterable的奇怪行爲

我需要在reduce()方法中，遍歷傳入的Iterable，以便於對重複的Tag進行累加操作。在遍歷該對象時，我發現了一個奇怪現象，即最終得到的每本書的Tag信息，全部變成了一樣的內容。通過對Reduce Job進行調試，發現每當遍歷到Iterable的下一個元素時，這個最新的值就會覆蓋之前得到的對象，使其變成同一個對象。通過Google，我發現這個問題是Hadoop的奇怪行爲，即Iterable對象的next()方法永遠會返回同一個對象。解決辦法就是在遍歷時，創建一個新對象放到我們要存儲的集合中，如下第5行代碼所示：

    public static class BookReduce extends Reducer<Text, BookTag, Text, BookTags> {
        public void reduce(Text key, Iterable<BookTag> values, Context context) throws IOException, InterruptedException {
            BookTags bookTags = new BookTags();
            for (BookTag tag : values) {
                bookTags.add(new BookTag(tag.getName(), tag.getCount()));
            }
            context.write(key, bookTags);
        }
    }

這裏得到的一個經驗是，在編寫MapReduce程序時，通過調試可以幫助你快速地定位問題。調試時，可以在項目的根目錄下建立input文件夾，將數據源文件放入到該文件夾中，然後在調試的參數中設置即可。

如何進行單元測試

我們同樣可以給MapReduce Job編寫單元測試。除了可以使用Mockito進行Mock之外，我認爲MRUnit可以更好地完成對MapReduce任務的驗證。MRUnit爲Map與Reduce提供了對應的Driver，即MapDriver與ReduceDriver。在編寫測試用例時，我們只需要爲Driver指定Input與Output，然後執行Driver的runTest()方法，即可測試任務的執行是否符合預期。這種預期是針對output輸出的結果而言。以WordCounter爲例，編寫的單元測試如下：

public class WordCounterTest {
    private MapDriver<LongWritable, Text, Text, IntWritable> mapDriver;
    private ReduceDriver<Text, IntWritable, Text, IntWritable> reduceDriver;

    @Before
    public void setUp() {
        WordCounter.Map tokenizerMapper = new WordCounter.Map();
        WordCounter.Reduce reducer = new WordCounter.Reduce();
        mapDriver = MapDriver.newMapDriver(tokenizerMapper);
        reduceDriver = ReduceDriver.newReduceDriver(reducer);
    }

    @Test
    public void should_execute_tokenizer_map_job() throws IOException {
        mapDriver.withInput(new LongWritable(12), new Text("I am Bruce Bruce"));
        mapDriver.withOutput(new Text("I"), new IntWritable(1));
        mapDriver.withOutput(new Text("am"), new IntWritable(1));
        mapDriver.withOutput(new Text("Bruce"), new IntWritable(1));
        mapDriver.withOutput(new Text("Bruce"), new IntWritable(1));
        mapDriver.runTest();
    }

    @Test
    public void should_execute_reduce_job() {
        List<IntWritable> values = new ArrayList<IntWritable>();
        values.add(new IntWritable(1));
        values.add(new IntWritable(3));

        reduceDriver.withInput(new Text("Bruce"), values);
        reduceDriver.withOutput(new Text("Bruce"), new IntWritable(4));
        reduceDriver.runTest();
    }
}

Chaining Job

通過利用Hadoop提供的ChainMapper與ChainReducer，可以較爲容易地實現多個Map Job或Reduce Job的鏈接。例如，我們可以將WordCounter分解爲Tokenizer與Upper Case兩個Map任務，最後執行Reduce。遺憾的是，ChainMapper與ChainReducer似乎不支持新版本的API，它要鏈接的Map與Reduce必須派生自MapReduceBase，並實現對應的Mapper或Reducer接口(說明，下面的代碼基本上來自於StackOverFlow的一個帖子)。

public class ChainWordCounter extends Configured implements Tool {
    public static class Tokenizer extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
        private final static IntWritable one = new IntWritable(1);
        private Text word = new Text();

        public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
            StringTokenizer tokenizer = new StringTokenizer(value.toString());
            while (tokenizer.hasMoreTokens()) {
                word.set(tokenizer.nextToken());
                output.collect(word, one);
            }
        }
    }

    public static class UpperCaser extends MapReduceBase implements Mapper<Text, IntWritable, Text, IntWritable> {
        public void map(Text key, IntWritable count, OutputCollector<Text, IntWritable> collector, Reporter reporter) throws IOException {
            collector.collect(new Text(key.toString().toUpperCase()), count);
        }
    }

    public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {
        private IntWritable result = new IntWritable();

        @Override
        public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> collector, Reporter reporter) throws IOException {
            int sum = 0;
            while (values.hasNext()) {
                sum += values.next().get();
            }

            result.set(sum);
            collector.collect(key, result);
        }
    }

    public int run(String[] args) throws Exception {
        JobConf jobConf = new JobConf(getConf(), ChainWordCounter.class);
        FileInputFormat.setInputPaths(jobConf, new Path(args[0]));

        FileInputFormat.setInputPaths(jobConf, new Path(args[0]));
        Path outputDir = new Path(args[1]);
        FileOutputFormat.setOutputPath(jobConf, outputDir);
        outputDir.getFileSystem(getConf()).delete(outputDir, true);

        JobConf tokenizerMapConf = new JobConf(false);
        ChainMapper.addMapper(jobConf, Tokenizer.class, LongWritable.class, Text.class, Text.class, IntWritable.class, true, tokenizerMapConf);

        JobConf upperCaserMapConf = new JobConf(false);
        ChainMapper.addMapper(jobConf, UpperCaser.class, Text.class, IntWritable.class, Text.class, IntWritable.class, true, upperCaserMapConf);

        JobConf reduceConf = new JobConf(false);
        ChainReducer.setReducer(jobConf, Reduce.class, Text.class, IntWritable.class, Text.class, IntWritable.class, true, reduceConf);

        JobClient.runJob(jobConf);
        return 0;
    }

    public static void main(String[] args) throws Exception {
        int ret = ToolRunner.run(new Configuration(), new ChainWordCounter(), args);
        System.exit(ret);
    }
}

不知道什麼時候這種機制能夠很好地支持新版的API。

jgzd1124

發佈了14 篇原創文章 · 獲贊 0 · 訪問量 4萬+

私信關注

Hadoop MapReduce技巧

自定義Writable

Iterable的奇怪行爲

如何進行單元測試

Chaining Job

Python實現大麥網搶票的四大關鍵技術點解析

salesforce零基礎學習（一百三十八）零碎知識點小總結（十）

一款開源的.NET程序集反編譯、編輯和調試神器

關於接口協議，你必須要知道這些！

【2024-05-21】以茶會友

pig 部署

hadoop 大數據開發5 --僞分佈式hbase配置異常

Hadoop MapReduce技巧

hive基礎學習文檔和入門教程

hbase shell基礎和常用命令詳解

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結