1、通用元素
select 字段:Map裏的value值。Reduce不做處理,遍歷輸出組內每一元素。
2、order by全局排序
- order by : 排序字段當做Map的key,Map中會自動分區、排序。
- 全局:1個Reduce,默認就是1個Reduce
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line=value.toString();
String[] words=line.split("\001");
//1)order by的字段抽取當做key,
//2)select *:整行(全字段)當做value
context.write(new Text(words[0]),new Text(line));
}
protected void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
//1)Reduce輸出每一個元素
for(Text line:values){
context.write(key,line);
}
}
輸出結果:
"P172E10-SN-01""20150130172902677003014654832192XASJ1"
"P172E10-SN-01""20150130203618099002901408624547XASJ1"
"P172E10-SN-01""20150130203726033002901408624247XASJ1"
"P172E10-SN-01""20150130203851342002901408624846XASJ1"
"P172E10-SN-01""20150130204711874002901408624669XASJ1"
"P839N31""20150131152252599003014654832409XASJ1"
"P839N31""20150131173954764003014654832951XASJ1"
"P839N31""20150131200923817000040229744422XASJ1"
"P839N31""20150131204500648000040229744318XASJ1"
"P172E10-SN-01""20150131205316663000040229744636XASJ1"
"P839N31""20150131205316663000040229744636XASJ1"
"MOBILENAME""TESTRECORDID"
"MOBILENAME""TESTRECORDID"
3、distribute by order by 局部排序
- distribute by:自定義分區類,按distribute鍵分區。
public static class UdfPartitioner extends Partitioner<Text,Text>{
@Override
public int getPartition(Text key, Text val, int i) {
Integer partn=0;
String mname=val.toString().split("\001")[0];
if (mname.equals("\"P172E10-SN-01\"")) partn=1;
if (mname.equals("\"P839N31\"")) partn=2;
return partn;
}
}
- job指定分區類:
job.setPartitionerClass(UdfPartitioner.class);
- job指定reducer個數:要與distribute鍵候選值個數一樣。
job.setNumReduceTasks(3); //reducer個數可以比distribute鍵值多;不能少,否則map的輸出沒有接收,會報錯。
- 怎麼自主知道分區個數?2個mr?
C:\Windows\System32>hdfs dfs -cat /outputOrder095143/part-r-00000
"TESTRECORDID" "MOBILENAME""TESTRECORDID"
"TESTRECORDID" "MOBILENAME""TESTRECORDID"
C:\Windows\System32>hdfs dfs -cat /outputOrder095143/part-r-00001
"20150130172902677003014654832192XASJ1" "P172E10-SN-01""20150130172902677003014654832192XASJ1"
"20150130203618099002901408624547XASJ1" "P172E10-SN-01""20150130203618099002901408624547XASJ1"
"20150130203726033002901408624247XASJ1" "P172E10-SN-01""20150130203726033002901408624247XASJ1"
"20150130203851342002901408624846XASJ1" "P172E10-SN-01""20150130203851342002901408624846XASJ1"
"20150130204711874002901408624669XASJ1" "P172E10-SN-01""20150130204711874002901408624669XASJ1"
"20150131205316663000040229744636XASJ1" "P172E10-SN-01""20150131205316663000040229744636XASJ1"
4、distinct
- 把distinct鍵當做map的鍵,值可以傳空NullWritable。
public static class WCLocalMapper extends Mapper<LongWritable,Text,Text,NullWritable>{
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line=value.toString();
String[] words=line.split("\001");
//distinct鍵當做map的鍵,值傳入NullWritable
context.write(new Text(words[0]),NullWritable.get());
}
}
- reduce中直接輸出key值即可,值可以傳空NullWritable。
public static class WCLocalReducer extends Reducer<Text,NullWritable,Text,NullWritable>{
@Override
protected void reduce(Text key, Iterable<NullWritable> values, Context context) throws IOException, InterruptedException {
//直接輸出key
context.write(key,NullWritable.get());
}
}
- Mapper、Reducer中的泛型輸入輸出要與Job裏設定的Mapper、Output一致。
job.setMapOutputValueClass(NullWritable.class);
job.setOutputValueClass(NullWritable.class);
5、count(distinct)
法1:全部相同的key值,1個reduce
缺點:無並行,效率低.
public static class WCLocalMapper extends Mapper<LongWritable,Text,Text,Text>{
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line=value.toString();
String[] words=line.split(",");
//map輸出的key值全部相同
context.write(new Text("key"),new Text(words[6]));
}
}
public static class WCLocalReducer extends Reducer<Text,Text,NullWritable,IntWritable>{
@Override
protected void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
//用HashSet去重
HashSet<String> hs=new HashSet<String>();
for(Text line:values){
hs.add(line.toString());
}
//輸出HashSet的元素個數
context.write(NullWritable.get(),new IntWritable(hs.size()));
}
}
法2:2個job,先輸出disticnt,再count
- distinct:把distinct鍵當做map的鍵,值可以傳空NullWritable。
- count:把第1個job的輸出目錄當做第2個目錄的輸入目錄,僅1個reduce,統計key裏傳進來的val的個數。
- 兩個job的銜接
if(job.waitForCompletion(true)){
System.exit(job2.waitForCompletion(true)?0:1);
};
6、reducer-join
- job:通過FileInputFormat.addInputPath()添加不同的文件來源
FileInputFormat.addInputPath(job, new Path("hdfs://localhost:9000/user/hive/warehouse/testdb.db/testtab_small/*"));
FileInputFormat.addInputPath(job, new Path("hdfs://localhost:9000/user/hive/warehouse/testdb.db/testtab_small2/*"));
- mapper:根據文件路徑不同,讀取對應相同的key(列可能不同),value拼接來源標記,混合
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
String[] words = line.split("\001");
/*不同表的數據,根據文件路徑,加AB前綴,放入同一key中*/
String fpath = ((FileSplit) context.getInputSplit()).getPath().toString();
System.out.println("**************fpath:"+fpath);
if (fpath.toString().indexOf("testtab_small2") > 0) {
context.write(new Text(words[1]), new Text("B#".concat(words[0])));
// System.out.println("**************A#:"+"A#".concat(words[0]));
}else if (fpath.toString().indexOf("testtab_small") > 0) {
context.write(new Text(words[1]), new Text("A#".concat(words[0])));
// System.out.println("**************B#:"+"B#".concat(words[0]));
}
}
- reducer:根據value的來源標記,分拆到不同的List;再把List的值笛卡爾積後輸出。
protected void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
//定義2個集合,分裝不同來源的數據
ArrayList<String> listA = new ArrayList<String>();
ArrayList<String> listB = new ArrayList<String>();
for (Text line : values) {
System.out.println("**************reducer.values:#:"+line.toString());
System.out.println("**************line.toString().indexOf(\"A#\")"+line.toString().indexOf("A#"));
System.out.println("**************line.toString().indexOf(\"B#\")"+line.toString().indexOf("B#"));
if (line.toString().indexOf("A#") > -1) {
System.out.println("**************reducer.values:A#:"+line.toString());
listA.add(line.toString().substring(2));
}
if (line.toString().indexOf("B#") > -1) {
listB.add(line.toString().substring(2));
}
System.out.println("listA.size():"+(listA.size()));
System.out.println("listB.size():"+(listB.size()));
}
//2個集合求笛卡爾積
for (String strA : listA) {
System.out.println("**************listA#:"+strA);
for (String strB : listB) {
System.out.println("**************listB#:"+strB);
context.write(key, new Text(strA.concat("/001").concat(strB)));
}
}
}
- ps:reducer中是相同key值的前提,A、B表對應的數據,所以連接就是笛卡爾積。
7、mapper-join
- 適用場景:關聯的一方是小文件,可以放入節點的內存中。
- 使用了hadoop的分佈式緩存技術:job.addCacheFile(new URI())把小文件從hdfs拉取到運行節點,然後讀取到內存。
job.addCacheFile(new URI("hdfs://localhost:9000/app/MDic.txt"));
//等價於hadoop命令參數:-files,將文件添加到分佈式緩存,以備將來被複制到任務節點。
- 使用了Mapper.setup():讀取文件
@Override
protected void setup(Context context) throws IOException, InterruptedException {
BufferedReader bReader=new BufferedReader(new InputStreamReader(new FileInputStream("MDic.txt")));
List<String> list= IOUtils.readLines(bReader);
String[] dicWords=null;
for(String string:list){
dicWords=string.split("\t");
mDic.put(dicWords[0],dicWords[1]);
}
}
//setup()運行在map任務之前,可以進行初始化動作。
- 在mapper裏直接拼接key,不需要reducer。
context.write(new Text(words[0]+"\t"+mDic.get(words[0])+"\t"+words[1]),NullWritable.get());
8、子查詢
2個job,類似count(distinct)
9、count(distinct cola) group by colb
select dealid, count(distinct uid) num from order group by dealid;
當只有一個distinct字段時,如果不考慮Map階段的Hash GroupBy,只需要將GroupBy字段和Distinct字段組合爲map輸出key,利用mapreduce的排序,同時將GroupBy字段作爲reduce的key,在reduce階段保存LastKey即可完成去重
上邊的原理是: mr自動完成了dealid+uid組合的唯一值篩選,然後reducer中輸出:dealid:uid就是分組去重值了。當然不能實現count功能。
ps:我自然想到的是groupby字段當做map的輸出key,然後在reducer中統計map輸出value的不重複值。但,這種方式reducer需要額外消耗內存。不如上邊的方法高效。