1、使用Spark通過Bulkload的方式導數據到Hbase
在未用Bulkload寫Hbase時,使用RDD進行封裝爲Tuple2<ImmutableBytesWritable, Put>的KVRDD,然後通過saveAsNewAPIHadoopDataset寫Hbase,非常慢,400G的數據大概寫了2H+還沒寫完,後面沒有辦法就考慮使用Bulkload來導入數據。
在測試之前網上很多資料都是Scala版本的,並且實現都是單個列來操作,實際生產中會存在多個列族和列的情況,並且這裏面有很多坑。
先上代碼:
public class HbaseSparkUtils {
private static Configuration hbaseConf;
static {
hbaseConf = HBaseConfiguration.create();
hbaseConf.set(ConfigUtils.getHbaseZK()._1(), ConfigUtils.getHbaseZK()._2());
hbaseConf.set(ConfigUtils.getHbaseZKPort()._1(), ConfigUtils.getHbaseZKPort()._2());
}
public static void saveHDFSHbaseHFile(SparkSession spark, // spark session
Dataset<Row> ds, // 數據集
String table_name, //hbase表名
Integer rowKeyIndex, //rowkey的索引id
String fields) throws Exception { // 數據集的字段列表
hbaseConf.setInt("hbase.mapreduce.bulkload.max.hfiles.perRegion.perFamily", 1024);
hbaseConf.set(TableOutputFormat.OUTPUT_TABLE, table_name);
Job job = Job.getInstance();
job.setMapOutputKeyClass(ImmutableBytesWritable.class);
job.setMapOutputValueClass(KeyValue.class);
job.setOutputFormatClass(HFileOutputFormat2.class);
Connection conn = ConnectionFactory.createConnection(hbaseConf);
TableName tableName = TableName.valueOf(table_name);
HRegionLocator regionLocator = new HRegionLocator(tableName, (ClusterConnection) conn);
Table realTable = ((ClusterConnection) conn).getTable(tableName);
HFileOutputFormat2.configureIncrementalLoad(job, realTable, regionLocator);
JavaRDD<Row> javaRDD = ds.toJavaRDD();
JavaPairRDD<ImmutableBytesWritable, KeyValue> javaPairRDD =
javaRDD.mapToPair(new PairFunction<Row, ImmutableBytesWritable, List<Tuple2<ImmutableBytesWritable, KeyValue>>>() {
@Override
public Tuple2<ImmutableBytesWritable, List<Tuple2<ImmutableBytesWritable, KeyValue>>> call(Row row) throws Exception {
List<Tuple2<ImmutableBytesWritable, KeyValue>> tps = new ArrayList<>();
String rowkey = row.getString(rowKeyIndex);
ImmutableBytesWritable writable = new ImmutableBytesWritable(Bytes.toBytes(rowkey));
// sort columns。這裏需要對列進行排序,不然會報錯
ArrayList<Tuple2<Integer, String>> tuple2s = new ArrayList<>();
String[] columns = fields.split(",");
for (int i = 0; i < columns.length; i++) {
tuple2s.add(new Tuple2<Integer, String>(i, columns[i]));
}
for (Tuple2<Integer, String> t : tuple2s) {
String[] fieldNames = row.schema().fieldNames();
// 不將作爲rowkey的字段存在列裏面
if (t._2().equals(fieldNames[rowKeyIndex])) {
System.out.println(String.format("%s == %s continue", t._2(), fieldNames[rowKeyIndex]));
continue;
}
if ("main".equals(t._2())) {
continue;
}
String value = getRowValue(row, t._1(), tuple2s.size());
KeyValue kv = new KeyValue(Bytes.toBytes(rowkey),
Bytes.toBytes(ConfigUtils.getFamilyInfo()._2()),
Bytes.toBytes(t._2()), Bytes.toBytes(value));
tps.add(new Tuple2<>(writable, kv));
}
for (Tuple2<Integer, String> t : tuple2s) {
String value = getRowValue(row, t._1(), tuple2s.size());
if ("main".equals(t._2())) { // filed == 'main'
KeyValue kv = new KeyValue(Bytes.toBytes(rowkey),
Bytes.toBytes(ConfigUtils.getFamilyMain()._2()),
Bytes.toBytes(t._2()), Bytes.toBytes(value));
tps.add(new Tuple2<>(writable, kv));
break;
}
}
return new Tuple2<>(writable, tps);
}
// 這裏一定要按照rowkey進行排序,這個效率很低,目前沒有找到優化的替代方案
}).sortByKey().flatMapToPair(new PairFlatMapFunction<Tuple2<ImmutableBytesWritable, List<Tuple2<ImmutableBytesWritable, KeyValue>>>,
ImmutableBytesWritable, KeyValue>() {
@Override
public Iterator<Tuple2<ImmutableBytesWritable, KeyValue>> call(Tuple2<ImmutableBytesWritable,
List<Tuple2<ImmutableBytesWritable, KeyValue>>> tuple2s) throws Exception {
return tuple2s._2().iterator();
}
});
// 創建HDFS的臨時HFile文件目錄
String temp = "/tmp/bulkload/"+table_name+"_"+System.currentTimeMillis();
javaPairRDD.saveAsNewAPIHadoopFile(temp, ImmutableBytesWritable.class,
KeyValue.class, HFileOutputFormat2.class, job.getConfiguration());
LoadIncrementalHFiles loader = new LoadIncrementalHFiles(hbaseConf);
Admin admin = conn.getAdmin();
loader.doBulkLoad(new Path(temp), admin, realTable, regionLocator);
}
}
2、下面是一些遇到的異常問題
源碼分析: