輸出格式要求:人員Id,姓名,地址
1、原始數據
人員ID 人員名稱 地址ID
1 張三 1
2 李四 2
3 王五 1
4 趙六 3
5 馬七 3
另外一組爲地址信息:
地址ID 地址名稱
1 北京
2 上海
3 廣州
2、處理說明
這裏給出了一個很簡單的例子,而且數據量很小,就這麼用眼睛就能看過來的幾行,當然,實際的情況可能是幾十萬上百萬甚至上億的數據量.
要實現的功能很簡單, 就是將人員信息與地址信息進行join,將人員的地址ID完善成爲地址名稱.
對於Hadoop文件系統的應用,目前看來,很多數據的存儲都是基於文本的, 而且都是將數據放在一個文件目錄中進行處理.因此我們這裏也採用這種模式來完成.
對於mapreduce程序來說,最主要的就是將要做的工作轉化爲map以及reduce兩個部分.
我們可以將地址以及人員都採用同樣的數據結構來存儲,通 過一個flag標誌來指定該數據結構裏面存儲的是地址信息還是人員信息.
經過map後,使用地址ID作爲key,將所有的具有相同地址的地址信息和人員信 息放入一個key->value list數據結構中傳送到reduce中進行處理.
在reduce過程中,由於key是地址的ID,所以value list中只有一個是地址信息,其他的都是人員信息,因此,找到該地址信息後,其他的人員信息的地址就是該地址所指定的地址名稱.
OK,我們的join算法基本搞定啦.剩下就是編程實現了,let’s go.
3、中間bean實現
上面提到了存儲人員和地址信息的數據結構,可以說這個數據結構是改程序的重要的數據載體之一.我們先來看看該數據結構:
package cn.edu.bjut.jointwo;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import org.apache.hadoop.io.WritableComparable;
public class Member implements WritableComparable<Member>{
private String userNo = "";
private String userName = "";
private String cityNo = "";
private String cityName = "";
private int flag = 0;
public Member() {
super();
}
public Member(String userNo, String userName, String cityNo,
String cityName, int flag) {
super();
this.userNo = userNo;
this.userName = userName;
this.cityNo = cityNo;
this.cityName = cityName;
this.flag = flag;
}
public Member(Member m) {
super();
this.userNo = m.getUserNo();
this.userName = m.getUserName();
this.cityNo = m.getCityNo();
this.cityName = m.getCityName();
this.flag = m.getFlag();
}
public void write(DataOutput out) throws IOException {
out.writeUTF(getUserNo());
out.writeUTF(getUserName());
out.writeUTF(getCityNo());
out.writeUTF(getCityName());
out.writeInt(getFlag());
}
public void readFields(DataInput in) throws IOException {
this.userNo = in.readUTF();
this.userName = in.readUTF();
this.cityNo = in.readUTF();
this.cityName = in.readUTF();
this.flag = in.readInt();
}
public int compareTo(Member o) {
return 0;
}
@Override
public String toString() {
return "userNo=" + userNo + ", userName=" + userName
+ ", cityNo=" + cityNo + ", cityName=" + cityName;
}
public String getUserNo() {
return userNo;
}
public void setUserNo(String userNo) {
this.userNo = userNo;
}
public String getUserName() {
return userName;
}
public void setUserName(String userName) {
this.userName = userName;
}
public String getCityNo() {
return cityNo;
}
public void setCityNo(String cityNo) {
this.cityNo = cityNo;
}
public String getCityName() {
return cityName;
}
public void setCityName(String cityName) {
this.cityName = cityName;
}
public int getFlag() {
return flag;
}
public void setFlag(int flag) {
this.flag = flag;
}
}
4.Mapper程序:
package cn.edu.bjut.jointwo;
import java.io.IOException;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class JoinMapper extends Mapper<LongWritable, Text, Text, Member> {
@Override
protected void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString();
String[] arr = line.split("\t");
if(arr.length >= 3) {
Member m = new Member();
m.setUserNo(arr[0]);
m.setUserName(arr[1]);
m.setCityNo(arr[2]);
m.setFlag(0);
context.write(new Text(m.getCityNo()), m);
} else {
Member m = new Member();
m.setCityNo(arr[0]);
m.setCityName(arr[1]);
m.setFlag(1);
context.write(new Text(m.getCityNo()), m);
}
}
}
5.Reducer程序:
package cn.edu.bjut.jointwo;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class JoinReducer extends Reducer<Text, Member, Text, NullWritable> {
@Override
protected void reduce(Text key, Iterable<Member> values, Context context)
throws IOException, InterruptedException {
Member m = null;
List<Member> list = new ArrayList<Member>();
for(Member member : values) {
if(0 == member.getFlag()) {
list.add(new Member(member));
} else {
m = new Member(member);
}
}
if(null != m) {
for(Member member : list) {
member.setCityName(m.getCityName());
context.write(new Text(member.toString()), NullWritable.get());
}
}
}
}
6.主程序:
package cn.edu.bjut.jointwo;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class MainJob {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, "jointwo");
job.setJarByClass(MainJob.class);
job.setMapperClass(JoinMapper.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Member.class);
job.setReducerClass(JoinReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(NullWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
Path outPathDir = new Path(args[1]);
FileSystem fs = FileSystem.get(conf);
if(fs.exists(outPathDir)) {
fs.delete(outPathDir, true);
}
FileOutputFormat.setOutputPath(job, outPathDir);
job.waitForCompletion(true);
}
}