MapReduce

一、概述：
MapReduce是Hadoop中的分佈式計算框架，MapReduce意味着在計算過程中實際分爲兩大步：Map過程和Reduce過程。

map任務：
1.讀取輸入文件內容，解析成key、value對。對輸入文件的每一行解析成key、value對。每一個鍵值對調用一次map函數。
2.寫自己的邏輯，對輸入的key、value進行處理，轉換成新的key、value輸出。
3.對輸出的key、value進行分區。
4.對相同分區的數據，按照key進行排序（默認按字典順序進行排序）、分組。相同的key的value放到一個集合中。
5.（可選）分組後的數據進行規約

注意：在MapReduce中，Mapper可以單獨存在，但是Reducer不能單獨存在。

Reduce任務
1.對多個map任務的輸出，按照不同的區，通過網絡copy到不同的節點。這個過程並不是map將數據發給reduce，而是reduce主動去獲取數據。Reduce的數量>=分區的數量
2.對多個map任務的輸出進行合併、排序。寫reduce函數自己的邏輯，對輸入的key、value進行處理，轉換成新的key、value輸出。
3.把reduce的輸出保存到文件中。

MapReduce執行流程
1. run job：客戶端提交一個mr的jar包給JobClient(提交方式：hadoop jar …。
1. 做job環境信息的收集，比如各個組件類，輸入輸出的kv類型等，檢測是否合法
2. 檢測輸入輸出的路徑是否合法
2. JobClient通過RPC和ResourceManager進行通信，返回一個存放jar包的地址（HDFS）和jobId。jobID是全局唯一的，用於標識該job
3. client將jar包寫入到HDFS當中(path = hdfs上的地址 + jobId)
4. 開始提交任務(任務的描述信息，不是jar, 包括jobid，jar存放的位置，配置信息等等)
5. JobTracker進行初始化任務
6. 讀取HDFS上要處理的文件，開始計算輸入切片，每一個切片對應一個MapperTask。注意：切片是一個對象，存儲的是這個切片的數據描述信息；切塊纔是文件塊（數據塊），裏面存儲的纔是真正的文件數據。
7. TaskTasker通過心跳機制領取任務（任務的描述信息）。切片一般和切塊是一樣的，即在實際開發中，切塊和切片默認是相同的。在領取到任務之後，要滿足數據本地化策略。
8. 下載所需的jar，配置文件等。體現的思想：移動的是運算/邏輯，而不是數據。
9. TaskTracker啓動一個java child子進程，用來執行具體的任務（MapperTask或ReducerTask）
10.將結果寫入到HDFS當中

一般而言，切片的描述的大小和切塊的大小是一致的，習慣上，會將namenode也作爲jobtracker，將datanode作爲TaskTracker

需要創建三個類，分別是：Mapper、Reducer、Driver
案列一：統計文件中每一個單詞出現的次數
Mapper：

public class WordCountMapper extends Mapper<LongWritable, Text, Text, LongWritable> {

	public void map(LongWritable ikey, Text ivalue, Context context) throws IOException, InterruptedException {

		String line = ivalue.toString();
		String[] arr = line.split(" ");
		for (String str : arr) {
			context.write(new Text(str), new LongWritable(1));
		}
	}
}

Reducer：

public class WordCountReducer extends Reducer<Text, LongWritable, Text, LongWritable> {

	public void reduce(Text _key, Iterable<LongWritable> values, Context context) throws IOException, InterruptedException {
		
		long sum = 0;
		for (LongWritable val : values) {
			sum += val.get();
		}
		context.write(_key, new LongWritable(sum));
	}
}

Driver：

public class WordCountDriver {

	public static void main(String[] args) throws Exception {
		Configuration conf = new Configuration();
		Job job = Job.getInstance(conf, "JobName");
		job.setJarByClass(cn.tedu.wc2.WordCountDriver.class);
		job.setMapperClass(WordCountMapper.class);
		job.setReducerClass(WordCountReducer.class);

		// 如果mapper的結果類型和reducer的結果類型一致，可以只設置一個
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(LongWritable.class);

		FileInputFormat.setInputPaths(job, new Path("hdfs://192.168.60.132:9000/mr/words.txt"));
		FileOutputFormat.setOutputPath(job, new Path("hdfs://192.168.60.132:9000/result2"));

		if (!job.waitForCompletion(true))
			return;
	}
}

序列化/反序列化機制
當自定義一個類之後，如果想要產生的對象在Hadoop中進行傳輸，那麼需要這個類實現Writable的接口進行序列化/反序列化。

public class Flow implements Writable{
	private String phone;
	private String city;
	private String name;
	private int flow;

	public String getPhone() {
		return phone;
	}

	public void setPhone(String phone) {
		this.phone = phone;
	}

	public String getCity() {
		return city;
	}

	public void setCity(String city) {
		this.city = city;
	}

	public String getName() {
		return name;
	}

	public void setName(String name) {
		this.name = name;
	}

	public int getFlow() {
		return flow;
	}

	public void setFlow(int flow) {
		this.flow = flow;
	}

	// 反序列化
	@Override
	public void readFields(DataInput in) throws IOException {
		// 按照序列化的順序一個一個將數據讀取出來
		this.phone = in.readUTF();
		this.city = in.readUTF();
		this.name = in.readUTF();
		this.flow = in.readInt();
	}

	// 序列化
	@Override
	public void write(DataOutput out) throws IOException {
		// 按照順序將屬性一個一個的寫出即可
		out.writeUTF(phone);
		out.writeUTF(city);
		out.writeUTF(name);
		out.writeInt(flow);
	}
}

分區 - Partitioner

        分區操作是shuffle操作中的一個重要過程，作用是將map的結果按照規則分發到不同的reduce中進行處理，從而按照分區得到多個輸出結果
        Partitioner是partitioner的基類，如果需要定製partitioner也需要繼承該類，HashPartitioner是MapReduce的默認partitioner。
        計算方法是：which reducer=(key.hashCode() & Integer.MAX_VALUE) % numReduceTasks注：默認情況下，reduceTask數量爲1
很多時候MR自帶的分區規則並不能滿足我們需求，爲了實現特定的效果，可以需要自己來定義分區規則。

案例：根據城市劃分，來統計每一個城市每一個人產生的流量
Mapper：

public class FlowMapper extends Mapper<LongWritable, Text, Text, Flow> {

	public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

		String line = value.toString();

		String[] arr = line.split(" ");

		Flow f = new Flow();
		f.setPhone(arr[0]);
		f.setCity(arr[1]);
		f.setName(arr[2]);
		f.setFlow(Integer.parseInt(arr[3]));
		
		context.write(new Text(f.getPhone()), f);
	}
}

///指定分區

public class FlowPartitioner extends Partitioner<Text, Flow> {

	@Override
	public int getPartition(Text key, Flow value, int numPartitions) {
		
		String city = value.getCity();
		
		if(city.equals("bj")){
			return 0;
		} else if(city.equals("sh"))
			return 1;
		else 
			return 2;
	}
}

Reducer：

public class FlowReducer extends Reducer<Text, Flow, Text, IntWritable> {

	public void reduce(Text key, Iterable<Flow> values, Context context) throws IOException, InterruptedException {
		
		int sum = 0;
		
		for (Flow val : values) {
			sum += val.getFlow();
		}
		context.write(key, new IntWritable(sum));
	}
}

Driver：

public class FlowDriver {

	public static void main(String[] args) throws Exception {
		Configuration conf = new Configuration();
		Job job = Job.getInstance(conf, "JobName");
		job.setJarByClass(cn.tedu.flow2.FlowDriver.class);
		job.setMapperClass(FlowMapper.class);
		job.setReducerClass(FlowReducer.class);
		
		job.setMapOutputKeyClass(Text.class);
		job.setMapOutputValueClass(Flow.class);

		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(IntWritable.class);
		
		// 指定分區
		job.setPartitionerClass(FlowPartitioner.class);
		// 指定分區所對應的reducer數量
		job.setNumReduceTasks(3);

		FileInputFormat.setInputPaths(job, new Path("hdfs://192.168.60.132:9000/mr/flow.txt"));
		FileOutputFormat.setOutputPath(job, new Path("hdfs://192.168.60.132:9000/fpresult"));

		if (!job.waitForCompletion(true))
			return;
	}
}

Combiner：

排序
如果想要進行排序，需要將對象實現WritableComparable<？>接口，然後將排序的對象作爲mapper中的鍵纔可以。

實現WritableComparable<？>接口

public class Profit implements WritableComparable<Profit> {

	private String name;
	private int profit;

	public String getName() {
		return name;
	}

	public void setName(String name) {
		this.name = name;
	}

	public int getProfit() {
		return profit;
	}

	public void setProfit(int profit) {
		this.profit = profit;
	}

	@Override
	public void write(DataOutput out) throws IOException {
		out.writeUTF(name);
		out.writeInt(profit);
	}

	@Override
	public void readFields(DataInput in) throws IOException {
		this.name = in.readUTF();
		this.profit = in.readInt();

	}

	// 如果需要對結果排序，需要將排序規則寫到這個方法中
	@Override
	public int compareTo(Profit o) {
		return this.profit - o.profit;
	}
}

mapper：將對象作爲輸出的鍵纔可以

public class SortMapper extends Mapper<LongWritable, Text, Profit, NullWritable> {

	public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
		
		String line = value.toString();
		
		String[] arr = line.split("\t");
		
		Profit p = new Profit();
		p.setName(arr[0]);
		p.setProfit(Integer.parseInt(arr[1]));
		
		context.write(p, NullWritable.get());
		
	}
}

Reducer和Driver略…

Hive

MapReduce（二）

HDFS

Flume的基本概念

MapReduce

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結