1、通用元素

select 字段：Map裏的value值。Reduce不做處理，遍歷輸出組內每一元素。

2、order by全局排序

order by : 排序字段當做Map的key，Map中會自動分區、排序。
全局：1個Reduce，默認就是1個Reduce

			protected  void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
				String line=value.toString();
				String[] words=line.split("\001");
				//1)order by的字段抽取當做key，
				//2)select *:整行(全字段)當做value
				context.write(new Text(words[0]),new Text(line));
			}

			protected void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
				//1)Reduce輸出每一個元素
				for(Text line:values){
					context.write(key,line);
				}

			}

輸出結果：

			        "P172E10-SN-01""20150130172902677003014654832192XASJ1"
					"P172E10-SN-01""20150130203618099002901408624547XASJ1"
					"P172E10-SN-01""20150130203726033002901408624247XASJ1"
					"P172E10-SN-01""20150130203851342002901408624846XASJ1"
					"P172E10-SN-01""20150130204711874002901408624669XASJ1"
					"P839N31""20150131152252599003014654832409XASJ1"
					"P839N31""20150131173954764003014654832951XASJ1"
					"P839N31""20150131200923817000040229744422XASJ1"
					"P839N31""20150131204500648000040229744318XASJ1"
					"P172E10-SN-01""20150131205316663000040229744636XASJ1"
					"P839N31""20150131205316663000040229744636XASJ1"
					"MOBILENAME""TESTRECORDID"
					"MOBILENAME""TESTRECORDID"

3、distribute by order by 局部排序

distribute by:自定義分區類，按distribute鍵分區。

			public static class UdfPartitioner extends Partitioner<Text,Text>{
				@Override
				public int getPartition(Text key, Text val, int i) {
					Integer partn=0;
					String mname=val.toString().split("\001")[0];
					if (mname.equals("\"P172E10-SN-01\"")) partn=1;
					if (mname.equals("\"P839N31\"")) partn=2;
					return partn;
				}
			}

job指定分區類：

			job.setPartitionerClass(UdfPartitioner.class);

job指定reducer個數：要與distribute鍵候選值個數一樣。

			job.setNumReduceTasks(3); //reducer個數可以比distribute鍵值多；不能少，否則map的輸出沒有接收，會報錯。

怎麼自主知道分區個數？2個mr？

		C:\Windows\System32>hdfs dfs -cat /outputOrder095143/part-r-00000
		"TESTRECORDID"  "MOBILENAME""TESTRECORDID"
		"TESTRECORDID"  "MOBILENAME""TESTRECORDID"
		C:\Windows\System32>hdfs dfs -cat /outputOrder095143/part-r-00001
		"20150130172902677003014654832192XASJ1" "P172E10-SN-01""20150130172902677003014654832192XASJ1"
		"20150130203618099002901408624547XASJ1" "P172E10-SN-01""20150130203618099002901408624547XASJ1"
		"20150130203726033002901408624247XASJ1" "P172E10-SN-01""20150130203726033002901408624247XASJ1"
		"20150130203851342002901408624846XASJ1" "P172E10-SN-01""20150130203851342002901408624846XASJ1"
		"20150130204711874002901408624669XASJ1" "P172E10-SN-01""20150130204711874002901408624669XASJ1"
		"20150131205316663000040229744636XASJ1" "P172E10-SN-01""20150131205316663000040229744636XASJ1"

4、distinct

把distinct鍵當做map的鍵，值可以傳空NullWritable。

			 public static class WCLocalMapper extends Mapper<LongWritable,Text,Text,NullWritable>{
				@Override
				protected  void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
					String line=value.toString();
					String[] words=line.split("\001");
					//distinct鍵當做map的鍵，值傳入NullWritable
					context.write(new Text(words[0]),NullWritable.get());
				}
			}

reduce中直接輸出key值即可，值可以傳空NullWritable。

			public static class WCLocalReducer extends Reducer<Text,NullWritable,Text,NullWritable>{
				@Override
				protected void reduce(Text key, Iterable<NullWritable> values, Context context) throws IOException, InterruptedException {
					//直接輸出key
					context.write(key,NullWritable.get());

				}
			}

Mapper、Reducer中的泛型輸入輸出要與Job裏設定的Mapper、Output一致。

			job.setMapOutputValueClass(NullWritable.class);
			job.setOutputValueClass(NullWritable.class);

5、count(distinct)

法1：全部相同的key值，1個reduce

缺點:無並行，效率低.

			public static class WCLocalMapper extends Mapper<LongWritable,Text,Text,Text>{
				@Override
				protected  void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
					String line=value.toString();
					String[] words=line.split(",");
					//map輸出的key值全部相同
					context.write(new Text("key"),new Text(words[6]));
				}
			}
			public static class WCLocalReducer extends Reducer<Text,Text,NullWritable,IntWritable>{
				@Override
				protected void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
					//用HashSet去重
					HashSet<String> hs=new HashSet<String>();
					for(Text line:values){
						hs.add(line.toString());
					}
					//輸出HashSet的元素個數
					context.write(NullWritable.get(),new IntWritable(hs.size()));

				}
			}

法2：2個job，先輸出disticnt，再count

distinct:把distinct鍵當做map的鍵，值可以傳空NullWritable。
count:把第1個job的輸出目錄當做第2個目錄的輸入目錄，僅1個reduce，統計key裏傳進來的val的個數。
兩個job的銜接

			if(job.waitForCompletion(true)){
                System.exit(job2.waitForCompletion(true)?0:1);
            };

6、reducer-join

job：通過FileInputFormat.addInputPath()添加不同的文件來源

FileInputFormat.addInputPath(job, new Path("hdfs://localhost:9000/user/hive/warehouse/testdb.db/testtab_small/*"));
FileInputFormat.addInputPath(job, new Path("hdfs://localhost:9000/user/hive/warehouse/testdb.db/testtab_small2/*"));

mapper：根據文件路徑不同，讀取對應相同的key(列可能不同)，value拼接來源標記，混合

protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
            String line = value.toString();
            String[] words = line.split("\001");

            /*不同表的數據，根據文件路徑，加AB前綴，放入同一key中*/
            String fpath = ((FileSplit) context.getInputSplit()).getPath().toString();
            System.out.println("**************fpath:"+fpath);
            if (fpath.toString().indexOf("testtab_small2") > 0) {
                context.write(new Text(words[1]), new Text("B#".concat(words[0])));
//                System.out.println("**************A#:"+"A#".concat(words[0]));
            }else if (fpath.toString().indexOf("testtab_small") > 0) {
                context.write(new Text(words[1]), new Text("A#".concat(words[0])));
//                System.out.println("**************B#:"+"B#".concat(words[0]));
            }

        }

reducer：根據value的來源標記，分拆到不同的List；再把List的值笛卡爾積後輸出。

protected void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
            //定義2個集合，分裝不同來源的數據
            ArrayList<String> listA = new ArrayList<String>();
            ArrayList<String> listB = new ArrayList<String>();
            for (Text line : values) {
                System.out.println("**************reducer.values:#:"+line.toString());
                System.out.println("**************line.toString().indexOf(\"A#\")"+line.toString().indexOf("A#"));
                System.out.println("**************line.toString().indexOf(\"B#\")"+line.toString().indexOf("B#"));
                if (line.toString().indexOf("A#") > -1) {
                    System.out.println("**************reducer.values:A#:"+line.toString());
                    listA.add(line.toString().substring(2));
                }
                if (line.toString().indexOf("B#") > -1) {
                    listB.add(line.toString().substring(2));
                }
                System.out.println("listA.size():"+(listA.size()));
                System.out.println("listB.size():"+(listB.size()));
            }
            
            //2個集合求笛卡爾積
            for (String strA : listA) {
                System.out.println("**************listA#:"+strA);
                for (String strB : listB) {
                    System.out.println("**************listB#:"+strB);
                    context.write(key, new Text(strA.concat("/001").concat(strB)));
                }
            }


        }

ps：reducer中是相同key值的前提，A、B表對應的數據，所以連接就是笛卡爾積。

7、mapper-join

適用場景：關聯的一方是小文件，可以放入節點的內存中。
使用了hadoop的分佈式緩存技術：job.addCacheFile(new URI())把小文件從hdfs拉取到運行節點，然後讀取到內存。

		job.addCacheFile(new URI("hdfs://localhost:9000/app/MDic.txt"));
		//等價於hadoop命令參數：-files，將文件添加到分佈式緩存，以備將來被複制到任務節點。

使用了Mapper.setup():讀取文件

			@Override
			protected void setup(Context context) throws IOException, InterruptedException {
				BufferedReader bReader=new BufferedReader(new InputStreamReader(new FileInputStream("MDic.txt")));
				List<String> list= IOUtils.readLines(bReader);
				String[] dicWords=null;
				for(String string:list){
					dicWords=string.split("\t");
					mDic.put(dicWords[0],dicWords[1]);
				}
			}
			//setup()運行在map任務之前，可以進行初始化動作。

在mapper裏直接拼接key，不需要reducer。

context.write(new Text(words[0]+"\t"+mDic.get(words[0])+"\t"+words[1]),NullWritable.get());

8、子查詢

2個job，類似count(distinct)

9、count(distinct cola) group by colb

select dealid, count(distinct uid) num from order group by dealid;

當只有一個distinct字段時，如果不考慮Map階段的Hash GroupBy，只需要將GroupBy字段和Distinct字段組合爲map輸出key，利用mapreduce的排序，同時將GroupBy字段作爲reduce的key，在reduce階段保存LastKey即可完成去重
上邊的原理是： mr自動完成了dealid+uid組合的唯一值篩選，然後reducer中輸出：dealid:uid就是分組去重值了。當然不能實現count功能。

ps：我自然想到的是groupby字段當做map的輸出key，然後在reducer中統計map輸出value的不重複值。但，這種方式reducer需要額外消耗內存。不如上邊的方法高效。

Hive實驗4：MapReduce實現Hive的查詢Sql

1、通用元素

2、order by全局排序

3、distribute by order by 局部排序

4、distinct

5、count(distinct)

法1：全部相同的key值，1個reduce

法2：2個job，先輸出disticnt，再count

6、reducer-join

7、mapper-join

8、子查詢

9、count(distinct cola) group by colb

《Python進階》學習筆記

一個docker容器暴露多個端口

leetcode 60 排列序列

Leetcode 3161. 物塊放置查詢

微服務實踐之使用 Visual Studio 2022 調試Dapr 應用程序

wpf附加屬性理解 WPF附加屬性

Oracle常用Sql:查看加鎖的SQL語句

Linux下Oracle的tnsping不顯示sqlnet.ora文件路徑

GR節點故障The member contains transactions not present in the group

Oracle rman中restore和recover的區別

Oracle數據庫備份與還原1：RMAN基本用法

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結