應聘——大數據研發(1)-MapReduce編程

MapReduce

本文參見《MapReduce Design Pattern》文中[實例代碼]

第一章:設計模式

Reader

將輸入數據轉換成key-value的形式,通常Key爲數據塊存放的地址,Value爲數據。

Map

自定義函數
key- is what the data will be grouped
value-is the information pertinent to the analysis in the reducer.

Combiner

Local 的Reducer, 爲了防止Map本地數據溢出。例子:wordcount中將傳三次(“hello”,1) comine成傳一次(“hello”,3)

Partitioner

輸入爲Combiner之後的key-value pair. 將Key值進行模運算,平均分配給Reduce節點。key.hashCode() % (number of reducers)。優化時,可加入sort算法。

以上都是在Map節點運行。


以下開始在Reduece節點運行。

Shuffle and Sort

用戶僅可自定義sort算法的comparator object。其他自動完成。
從map node拉取和下載key-value pair, 然後根據key排序。

Reduce

自定義的核心函數,根據任務不同,大致可劃分:
-Summarization patterns
-Filter patterns
-Data organization patterns
-Join patterns
-Metapatterns

Output

將value提取出來輸出


Wordcount 基本實現

//忽略所有import
public class CommentWordCount {

    public static class SOWordCountMapper extends
            Mapper<Object, Text, Text, IntWritable> {

        private final static IntWritable one = new IntWritable(1);
        private Text word = new Text();

        public void map(Object key, Text value, Context context)
                throws IOException, InterruptedException {

            // Parse the input string into a nice map
            Map<String, String> parsed = MRDPUtils.transformXmlToMap(value
                    .toString());

            // Grab the "Text" field, since that is what we are counting over
            String txt = parsed.get("Text");

            // .get will return null if the key is not there
            if (txt == null) {
                // skip this record
                return;
            }

            // Unescape the HTML because the SO data is escaped.
            txt = StringEscapeUtils.unescapeHtml(txt.toLowerCase());

            // Remove some annoying punctuation
            txt = txt.replaceAll("'", ""); // remove single quotes (e.g., can't)
            txt = txt.replaceAll("[^a-zA-Z]", " "); // replace the rest with a
                                                    // space

            // Tokenize the string, then send the tokens away
            StringTokenizer itr = new StringTokenizer(txt);
            while (itr.hasMoreTokens()) {
                word.set(itr.nextToken());
                context.write(word, one);
            }
        }
    }

    public static class IntSumReducer extends
            Reducer<Text, IntWritable, Text, IntWritable> {
        private IntWritable result = new IntWritable();

        public void reduce(Text key, Iterable<IntWritable> values,
                Context context) throws IOException, InterruptedException {
            int sum = 0;
            for (IntWritable val : values) {
                sum += val.get();
            }

            result.set(sum);
            context.write(key, result);

        }
    }

    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        String[] otherArgs = new GenericOptionsParser(conf, args)
                .getRemainingArgs();
        if (otherArgs.length != 2) {
            System.err.println("Usage: CommentWordCount <in> <out>");
            System.exit(2);
        }
        Job job = new Job(conf, "StackOverflow Comment Word Count");
        job.setJarByClass(CommentWordCount.class);
        job.setMapperClass(SOWordCountMapper.class);
        job.setCombinerClass(IntSumReducer.class);
        job.setReducerClass(IntSumReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
        FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}

父類Mapper裏入口 (Text, Text, Text, IntWritable),分別代表的是input-key, input-value , output-key, output-value.
同理,Reducer的入口(Text, IntWritable,Text, InWritable),分別代表的也是input-key, input-value , output-key, output-value.


Wordcount Spark 實現

package org.apache.spark.examples;

import scala.Tuple2;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.api.java.function.Function2;
import org.apache.spark.api.java.function.PairFunction;

import java.util.Arrays;
import java.util.List;
import java.util.regex.Pattern;

public final class JavaWordCount {
  private static final Pattern SPACE = Pattern.compile(" ");

  public static void main(String[] args) throws Exception {

    if (args.length < 1) {
      System.err.println("Usage: JavaWordCount <file>");
      System.exit(1);
    }

    SparkConf sparkConf = new SparkConf().setAppName("JavaWordCount");
    JavaSparkContext ctx = new JavaSparkContext(sparkConf);
    JavaRDD<String> lines = ctx.textFile(args[0], 1);

    JavaRDD<String> words = lines.flatMap(new FlatMapFunction<String, String>() {
      @Override
      public Iterable<String> call(String s) {
        return Arrays.asList(SPACE.split(s));
      }
    });

    JavaPairRDD<String, Integer> ones = words.mapToPair(new PairFunction<String, String, Integer>() {
      @Override
      public Tuple2<String, Integer> call(String s) {
        return new Tuple2<String, Integer>(s, 1);
      }
    });

    JavaPairRDD<String, Integer> counts = ones.reduceByKey(new Function2<Integer, Integer, Integer>() {
      @Override
      public Integer call(Integer i1, Integer i2) {
        return i1 + i2;
      }
    });

    List<Tuple2<String, Integer>> output = counts.collect();
    for (Tuple2<?,?> tuple : output) {
      System.out.println(tuple._1() + ": " + tuple._2());
    }
    ctx.stop();
  }
}

Wordcount SparkCL 實現

 JavaRDD<String> words = lines.flatMap(new FlatMapFunction<String, String>() {
      @Override
      public Iterable<String> call(String s) {
      return Arrays.asList(SPACE.split(s));
      }
    });

    JavaPairRDD<String, Integer> ones = words.mapToPair(new PairFunction<String, String, Integer>() {
    @Override
    public Tuple2<String, Integer> call(String s) {
   return new Tuple2<String, Integer>(s, 1);
      }
    });

以上部分和spark一樣,但是Reduce部分有更改kernel.

JavaRDD<Tuple2<String, Integer>> countTuples =SparkUtil.genSparkCL(ones.groupByKey()).mapCL(kernel);

以下是kernel的實例化

SparkKernel<Tuple2<String, Iterable<Integer>>, Tuple2<String, Integer>> kernel = new SparkKernel<Tuple2<String, Iterable<Integer>>, Tuple2<String, Integer>>() 
    {
        // data
        int []dataArray;
        int []sumArray;

        // minimum amount of data before using accelerator
        // note this should be significantly large, but kept small for the purpose of the demo
final int MinDataSizeForAcceleration = 10;

        @Override
        public void mapParameters(Tuple2<String, Iterable<Integer>> data)
        {
            dataArray = SparkUtil.intArrayFromIterator(data._2.iterator());
// decide if to execute the kernel or not
            if(dataArray.length<MinDataSizeForAcceleration)
            {
                setShouldExecute(false);
                return;
            }
            else
                setShouldExecute(true);
            //////////////////////////////////////////////
            // !!! temp hack -> handle a case where size is not divisible by two. Needs more work...
            //////////////////////////////////////////////
            if(dataArray.length%2!=0)
            {
    int []tempArray = dataArray.clone();
dataArray = new int[dataArray.length+1];
for(int i=0;i<tempArray.length; i++)
dataArray[i] = tempArray[i];
            }
        int dataLength = dataArray.length;
            sumArray = new int[dataLength];
            setRange(Range.create(dataLength));
            buffer_$local$ = new int[getRange().getLocalSize(0)];
        }

        //@Local symbol does not seem to be working yet in aparapi 
        // we use $local$ convention instead
        // define local memory type to improve performance. For more info on local memory ->
        // https://www.khronos.org/registry/cl/sdk/1.1/docs/man/xhtml/local.html
        int[] buffer_$local$;

        @Override
        public void run() 
        {

            int gid = getGlobalId();
            int lid = getLocalId();
            int localSize = getLocalSize();
            int localGroupIndex = gid / localSize;

            final int upperGlobalIndexBound = getGlobalSize() - 1; 
            final int maxValidLocalIndex=localSize>>1;

            int baseGlobalIndex = 2 * localSize * localGroupIndex + lid;

            if(baseGlobalIndex<upperGlobalIndexBound)
                buffer_$local$[lid] = dataArray[baseGlobalIndex] + dataArray[baseGlobalIndex + 1];

            localBarrier();

            if(lid==0)
            {
              for(int i=0;i<maxValidLocalIndex;i++)
                sumArray[localGroupIndex] += buffer_$local$[i];
            }
        }

        @Override//actually this is reduce
        public Tuple2<String, Integer> mapReturnValue(Tuple2<String, Iterable<Integer>> data) 
        {
            int sum = 0;
            // if kernel was executed 
            if(shouldExecute())
            {
              for(int i=0;i<dataArray.length/getRange().getLocalSize(0);i++)
                sum += sumArray[i];
            }
            // kernel was not executed, not enough data, so perform a CPU simple aggregation
            else
            {
                Iterator<Integer> itr = data._2.iterator();
                while(itr.hasNext())
                   sum+=itr.next();
            }

            return  new Tuple2<String, Integer>(data._1,sum);
        }
    };
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章