Hive多字符的分隔符

原創

猿天霸 Tuan

2018-11-30 16:00

hive

默認是隻支持單字符的分隔符，默認單字符是\001。當然你也可以在創建表格時指定數據的分割符號。如：

create table user(name string, password string) row format delimited fields terminated by ‘\t’。

通過這種方式，完成分隔符的指定。

如果你想要支持多字符的分隔符可以通過如下方式：

1、自定義一個 InputFormat ，重寫 InputFormat 中 RecordReader 類中的 next 方法.。當然這是輸入的時候調用的，輸出的時候也是可以設定不同的分隔符的。方法和輸入一樣，自定義一個OutputFormat，不過這裏要注意的是：自定義的OutputFormat必須要實現HiveOutputFormat接口，重寫 OutputFormat 中 RecordWriter 中的 write 方法，這裏可以參考HiveIgnoreKeyTextOutputFormat類。Hive的InputFormat/OutputFormat與Hadoop 的InputFormat/OutputFormat相當類似，InputFormat負責把輸入的數據進行格式化或轉換處理，然後提供給Hive，OutputFormat 負責把 Hive輸出的數據重新格式化成目標格式再輸出到文件，這種對格式進行定製的方式較爲底層。重寫完成後打包成jar，放入到Hive目錄的lib文件夾下面。

public synchronized boolean next(LongWritable key, Text value)  
	            throws IOException {  
	 while (pos < end) {  
	       key.set(pos);  
	       int newSize = lineReader.readLine(value, maxLineLength,  
	           	Math.max((int) Math.min(Integer.MAX_VALUE, end - pos), maxLineLength));  
	       if (newSize == 0)  return false;  
	       String str = value.toString().toLowerCase().replaceAll("::", ":");  
	       value.set(str);  
	       pos += newSize;  
	       if (newSize < maxLineLength)  return true;  
	 }  
	 return false;  
}

@Override
public void write(Writable r) throws IOException {
	if (r instanceof Text) {
		Text tr = (Text) r;
		String strReplace = tr.toString().toLowerCase().replace(":", "::");
		Text txtReplace = new Text();
		txtReplace.set(strReplace);
		outStream.write(txtReplace.getBytes(), 0, txtReplace.getLength());
		outStream.write(finalRowSeparator);
	} else {
		BytesWritable bw = (BytesWritable) r;
		outStream.write(bw.get(), 0, bw.getSize());
		outStream.write(finalRowSeparator);
	}
}

需要重新進入shell模式，在創建表的時候如下操作：

create table user(username string,password string)   
row format delimited  
fields terminated by ':'   
stored as
INPUTFORMAT 'org.platform.utils.bigdata.hive.CustomInputFormat'  
OUTPUTFORMAT 'org.platform.utils.bigdata.hive.CustomOutputFormat';

2、通過 SerDe(serialize/deserialize) ，在數據序列化和反序列化時格式化數據。這種方式比較複雜一點，對數據的控制能力也要弱一些，它使用正則表達式來匹配和處理數據，性能也會有所影響。但它的優點是可以自定義表屬性信息SERDEPROPERTIES，在 SerDe 中通過這些屬性信息可以有更多的定製行爲。參考示例：

create table user(username string,password string,nickname string) 
row format serde 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' 
with serdeproperties (
'input.regex'='([^:]*)::([^:]*)::([^:]*)',
'output.format.string'='%1$s %2$s %3$3') 
stored as textfile;

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Hive多字符的分隔符

hive

HTML頁面關於高分屏的設置

北歐瑞典挪威芬蘭瑞士TikTok海外網紅與YouTube博主的合作模式

歐洲英國德國法國TikTok與YouTube海外網紅達人的完美合作策略

druid數據源 xml配置

圖解Kerberos原理

hive常見的幾種文件存儲格式與壓縮方式的結合

Hive多字符的分隔符

Hive 企業級調優

Hive 教程（官方Tutorial）

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結