Hadoop提供的reduce函數中Iterable 接口只能遍歷一次的問題

之前有童鞋問到了這樣一個問題：爲什麼我在 reduce 階段遍歷了一次 Iterable 之後，再次遍歷的時候，數據都沒了呢？可能有童鞋想當然的回答：Iterable 只能單向遍歷一次，就這樣簡單的原因。。。事實果真如此嗎？

還是用代碼說話：

[java]view
plain copy

package com.test;  

import java.util.ArrayList;  

import java.util.Iterator;  

import java.util.List;  

public class T {  

    public static void main(String[] args) {  

        // 只要實現了Iterable接口的對象都可以使用for-each循環。  

        // Iterable接口只由iterator方法構成，  

        // iterator()方法是java.lang.Iterable接口，被Collection繼承。  

        /*public interface Iterable<T> { 

            Iterator<T> iterator(); 

        }*/  

        Iterable<String> iter = new Iterable<String>() {  

            public Iterator<String> iterator() {  

                List<String> l = new ArrayList<String>();  

                l.add("aa");  

                l.add("bb");  

                l.add("cc");  

                return l.iterator();  

            }  

        };  

        for(int count : new int[] {1, 2}){  

            for (String item : iter) {  

                System.out.println(item);  

            }  

            System.out.println("---------->> " + count + " END.");  

        }  

    }  

}

結果當然是很正常的完整無誤的打印了兩遍 Iterable 的值。那究竟是什麼原因導致了 reduce 階段的 Iterable 只能被遍歷一次呢？

我們先看一段測試代碼：

測試數據：

[java]view
plain copy

a 3  

a 4  

b 50  

b 60  

a 70  

b 8  

a 9

[java]view
plain copy

<pre name="code" class="java">import java.io.IOException;  

import java.util.ArrayList;  

import java.util.List;  

import org.apache.hadoop.conf.Configuration;  

import org.apache.hadoop.fs.FileSystem;  

import org.apache.hadoop.fs.Path;  

import org.apache.hadoop.io.Text;  

import org.apache.hadoop.mapreduce.Job;  

import org.apache.hadoop.mapreduce.Mapper;  

import org.apache.hadoop.mapreduce.Reducer;  

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;  

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;  

import org.apache.hadoop.util.GenericOptionsParser;  

public class TestIterable {  

    public static class M1 extends Mapper<Object, Text, Text, Text> {  

        private Text oKey = new Text();  

        private Text oVal = new Text();  

        String[] lineArr;  

        public void map(Object key, Text value, Context context) throws IOException, InterruptedException {  

            lineArr = value.toString().split(" ");  

            oKey.set(lineArr[0]);  

            oVal.set(lineArr[1]);  

            context.write(oKey, oVal);  

        }  

    }  

    public static class R1 extends Reducer<Text, Text, Text, Text> {  

        List<String> valList = new ArrayList<String>();  

        List<Text> textList = new ArrayList<Text>();  

        String strAdd;  

        public void reduce(Text key, Iterable<Text> values, Context context) throws IOException,  

                InterruptedException {  

            valList.clear();  

            textList.clear();  

            strAdd = "";  

            for (Text val : values) {  

                valList.add(val.toString());  

                textList.add(val);  

            }  

            // 坑之 1 ：爲神馬輸出的全是最後一個值？why？  

            for(Text text : textList){  

                strAdd += text.toString() + ", ";  

            }  

            System.out.println(key.toString() + "\t" + strAdd);  

            System.out.println(".......................");  

            // 我這樣幹呢？對了嗎？  

            strAdd = "";  

            for(String val : valList){  

                strAdd += val + ", ";  

            }  

            System.out.println(key.toString() + "\t" + strAdd);  

            System.out.println("----------------------");  

            // 坑之 2 ：第二次遍歷的時候爲什麼得到的都是空？why？  

            valList.clear();  

            strAdd = "";  

            for (Text val : values) {  

                valList.add(val.toString());  

            }  

            for(String val : valList){  

                strAdd += val + ", ";  

            }  

            System.out.println(key.toString() + "\t" + strAdd);  

            System.out.println(">>>>>>>>>>>>>>>>>>>>>>");  

        }  

    }  

    public static void main(String[] args) throws Exception {  

        Configuration conf = new Configuration();  

        conf.set("mapred.job.queue.name", "regular");  

        String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();  

        if (otherArgs.length != 2) {  

            System.err.println("Usage: wordcount <in> <out>");  

            System.exit(2);  

        }  

        System.out.println("------------------------");  

        Job job = new Job(conf, "TestIterable");  

        job.setJarByClass(TestIterable.class);  

        job.setMapperClass(M1.class);  

        job.setReducerClass(R1.class);  

        job.setOutputKeyClass(Text.class);  

        job.setOutputValueClass(Text.class);  

        // 輸入輸出路徑  

        FileInputFormat.addInputPath(job, new Path(otherArgs[0]));  

        FileSystem.get(conf).delete(new Path(otherArgs[1]), true);  

        FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));  

        System.exit(job.waitForCompletion(true) ? 0 : 1);  

    }  

}

在 Eclipse 控制檯中的結果如下：

[java]view
plain copy

a   9, 9, 9, 9,   

.......................  

a   3, 4, 70, 9,   

----------------------  

a     

>>>>>>>>>>>>>>>>>>>>>>  

b   8, 8, 8,   

.......................  

b   50, 60, 8,   

----------------------  

b     

>>>>>>>>>>>>>>>>>>>>>>

關於第 1 個坑：對象重用（ objects reuse ）

reduce方法的javadoc中已經說明了會出現的問題：

The framework calls this method for each <key, (list of values)> pair in the grouped inputs. Output values must be of the same type as input values. Input keys must not be altered. The framework will reuse the key and value objects that are passed into the reduce, therefore the application should clone the objects they want to keep a copy of.

也就是說雖然reduce方法會反覆執行多次，但key和value相關的對象只有兩個，reduce會反覆重用這兩個對象。所以如果要保存key或者value的結果，只能將其中的值取出另存或者重新clone一個對象（例如Text store = new Text(value) 或者 String a = value.toString()），而不能直接賦引用。因爲引用從始至終都是指向同一個對象，你如果直接保存它們，那最後它們都指向最後一個輸入記錄。會影響最終計算結果而出錯。

看到這裏，我想你會恍然大悟：這不是剛畢業找工作，面試官常問的問題：String 是不可變對象但爲什麼能相加呢？爲什麼字符串相加不提倡用 String，而用 StringBuilder ？如果你還不清楚這個問題怎麼回答，建議你看看這篇《深入理解 String, StringBuffer 與 StringBuilder 的區別》http://my.oschina.net/leejun2005/blog/102377

關於第 2 個坑：http://stackoverflow.com/questions/6111248/iterate-twice-on-values

The Iterator you receive from that Iterable's iterator() method is special. The values may not all be in memory; Hadoop may be streaming them from disk. They aren't really backed by a Collection, so it's nontrivial to allow multiple iterations.

Hadoop提供的reduce函數中Iterable 接口只能遍歷一次的問題

vue項目獲取富文本編輯器wangEditor內容導出爲word（html轉word格式並下載）

dotnet C# 創建 X11 應用時設置窗口背景顏色

Navicat安裝與激活教程

TDengine docker安裝方法

vue3組件通信與props

sapui5

Alpine Linux apk add DNS lookup error

部分JDK版本的發佈時間

工作中用到的腳本合集

合併代碼時Beyond Compare設置

Giraph 操作參數

scala 學習筆記-持續更新中

Linux下遠程同步或傳輸文件

Giraph 運行常見錯誤

Maven 手動安裝 JAR 包

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結