MapReduce
本文參見《MapReduce Design Pattern》文中[實例代碼]
第一章:設計模式
Reader
將輸入數據轉換成key-value的形式,通常Key爲數據塊存放的地址,Value爲數據。
Map
自定義函數
key- is what the data will be grouped
value-is the information pertinent to the analysis in the reducer.
Combiner
Local 的Reducer, 爲了防止Map本地數據溢出。例子:wordcount中將傳三次(“hello”,1) comine成傳一次(“hello”,3)
Partitioner
輸入爲Combiner之後的key-value pair. 將Key值進行模運算,平均分配給Reduce節點。key.hashCode() % (number of reducers)。優化時,可加入sort算法。
以上都是在Map節點運行。
以下開始在Reduece節點運行。
Shuffle and Sort
用戶僅可自定義sort算法的comparator object。其他自動完成。
從map node拉取和下載key-value pair, 然後根據key排序。
Reduce
自定義的核心函數,根據任務不同,大致可劃分:
-Summarization patterns
-Filter patterns
-Data organization patterns
-Join patterns
-Metapatterns
Output
將value提取出來輸出
Wordcount 基本實現
//忽略所有import
public class CommentWordCount {
public static class SOWordCountMapper extends
Mapper<Object, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
// Parse the input string into a nice map
Map<String, String> parsed = MRDPUtils.transformXmlToMap(value
.toString());
// Grab the "Text" field, since that is what we are counting over
String txt = parsed.get("Text");
// .get will return null if the key is not there
if (txt == null) {
// skip this record
return;
}
// Unescape the HTML because the SO data is escaped.
txt = StringEscapeUtils.unescapeHtml(txt.toLowerCase());
// Remove some annoying punctuation
txt = txt.replaceAll("'", ""); // remove single quotes (e.g., can't)
txt = txt.replaceAll("[^a-zA-Z]", " "); // replace the rest with a
// space
// Tokenize the string, then send the tokens away
StringTokenizer itr = new StringTokenizer(txt);
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
public static class IntSumReducer extends
Reducer<Text, IntWritable, Text, IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,
Context context) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(conf, args)
.getRemainingArgs();
if (otherArgs.length != 2) {
System.err.println("Usage: CommentWordCount <in> <out>");
System.exit(2);
}
Job job = new Job(conf, "StackOverflow Comment Word Count");
job.setJarByClass(CommentWordCount.class);
job.setMapperClass(SOWordCountMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
父類Mapper裏入口 (Text, Text, Text, IntWritable),分別代表的是input-key, input-value , output-key, output-value.
同理,Reducer的入口(Text, IntWritable,Text, InWritable),分別代表的也是input-key, input-value , output-key, output-value.
Wordcount Spark 實現
package org.apache.spark.examples;
import scala.Tuple2;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.api.java.function.Function2;
import org.apache.spark.api.java.function.PairFunction;
import java.util.Arrays;
import java.util.List;
import java.util.regex.Pattern;
public final class JavaWordCount {
private static final Pattern SPACE = Pattern.compile(" ");
public static void main(String[] args) throws Exception {
if (args.length < 1) {
System.err.println("Usage: JavaWordCount <file>");
System.exit(1);
}
SparkConf sparkConf = new SparkConf().setAppName("JavaWordCount");
JavaSparkContext ctx = new JavaSparkContext(sparkConf);
JavaRDD<String> lines = ctx.textFile(args[0], 1);
JavaRDD<String> words = lines.flatMap(new FlatMapFunction<String, String>() {
@Override
public Iterable<String> call(String s) {
return Arrays.asList(SPACE.split(s));
}
});
JavaPairRDD<String, Integer> ones = words.mapToPair(new PairFunction<String, String, Integer>() {
@Override
public Tuple2<String, Integer> call(String s) {
return new Tuple2<String, Integer>(s, 1);
}
});
JavaPairRDD<String, Integer> counts = ones.reduceByKey(new Function2<Integer, Integer, Integer>() {
@Override
public Integer call(Integer i1, Integer i2) {
return i1 + i2;
}
});
List<Tuple2<String, Integer>> output = counts.collect();
for (Tuple2<?,?> tuple : output) {
System.out.println(tuple._1() + ": " + tuple._2());
}
ctx.stop();
}
}
Wordcount SparkCL 實現
JavaRDD<String> words = lines.flatMap(new FlatMapFunction<String, String>() {
@Override
public Iterable<String> call(String s) {
return Arrays.asList(SPACE.split(s));
}
});
JavaPairRDD<String, Integer> ones = words.mapToPair(new PairFunction<String, String, Integer>() {
@Override
public Tuple2<String, Integer> call(String s) {
return new Tuple2<String, Integer>(s, 1);
}
});
以上部分和spark一樣,但是Reduce部分有更改kernel.
JavaRDD<Tuple2<String, Integer>> countTuples =SparkUtil.genSparkCL(ones.groupByKey()).mapCL(kernel);
以下是kernel的實例化
SparkKernel<Tuple2<String, Iterable<Integer>>, Tuple2<String, Integer>> kernel = new SparkKernel<Tuple2<String, Iterable<Integer>>, Tuple2<String, Integer>>()
{
// data
int []dataArray;
int []sumArray;
// minimum amount of data before using accelerator
// note this should be significantly large, but kept small for the purpose of the demo
final int MinDataSizeForAcceleration = 10;
@Override
public void mapParameters(Tuple2<String, Iterable<Integer>> data)
{
dataArray = SparkUtil.intArrayFromIterator(data._2.iterator());
// decide if to execute the kernel or not
if(dataArray.length<MinDataSizeForAcceleration)
{
setShouldExecute(false);
return;
}
else
setShouldExecute(true);
//////////////////////////////////////////////
// !!! temp hack -> handle a case where size is not divisible by two. Needs more work...
//////////////////////////////////////////////
if(dataArray.length%2!=0)
{
int []tempArray = dataArray.clone();
dataArray = new int[dataArray.length+1];
for(int i=0;i<tempArray.length; i++)
dataArray[i] = tempArray[i];
}
int dataLength = dataArray.length;
sumArray = new int[dataLength];
setRange(Range.create(dataLength));
buffer_$local$ = new int[getRange().getLocalSize(0)];
}
//@Local symbol does not seem to be working yet in aparapi
// we use $local$ convention instead
// define local memory type to improve performance. For more info on local memory ->
// https://www.khronos.org/registry/cl/sdk/1.1/docs/man/xhtml/local.html
int[] buffer_$local$;
@Override
public void run()
{
int gid = getGlobalId();
int lid = getLocalId();
int localSize = getLocalSize();
int localGroupIndex = gid / localSize;
final int upperGlobalIndexBound = getGlobalSize() - 1;
final int maxValidLocalIndex=localSize>>1;
int baseGlobalIndex = 2 * localSize * localGroupIndex + lid;
if(baseGlobalIndex<upperGlobalIndexBound)
buffer_$local$[lid] = dataArray[baseGlobalIndex] + dataArray[baseGlobalIndex + 1];
localBarrier();
if(lid==0)
{
for(int i=0;i<maxValidLocalIndex;i++)
sumArray[localGroupIndex] += buffer_$local$[i];
}
}
@Override//actually this is reduce
public Tuple2<String, Integer> mapReturnValue(Tuple2<String, Iterable<Integer>> data)
{
int sum = 0;
// if kernel was executed
if(shouldExecute())
{
for(int i=0;i<dataArray.length/getRange().getLocalSize(0);i++)
sum += sumArray[i];
}
// kernel was not executed, not enough data, so perform a CPU simple aggregation
else
{
Iterator<Integer> itr = data._2.iterator();
while(itr.hasNext())
sum+=itr.next();
}
return new Tuple2<String, Integer>(data._1,sum);
}
};