Hive 去除 CSV 字段中的雙引號

數據是一個 CSV 文件,示例如下:


"InvoiceID","PayerAccountId","LinkedAccountId","RecordType","ProductName","RateId","SubscriptionId","PricingPlanId","UsageType","Operation","AvailabilityZone","ReservedInstance","ItemDescription","UsageStartDate","UsageEndDate","UsageQuantity","Rate","Cost"
"Estimated","xxxxxxxxxxxx","xxxxxxxxxxxx","LineItem","Amazon Simple Queue Service","16850885","1846142824","1292565","CNN1-Requests-Tier1","GetQueueAttributes","","N","First 1,000,000 Amazon SQS Requests per month are free","2019-01-01 00:00:00","2019-01-01 01:00:00","60.0000000000","0.0000000000","0.0000000000"
"Estimated","xxxxxxxxxxxx","xxxxxxxxxxxx","LineItem","Amazon Simple Queue Service","16850885","1846142824","1292565","CNN1-Requests-Tier1","GetQueueUrl","","N","First 1,000,000 Amazon SQS Requests per month are free","2019-01-01 00:00:00","2019-01-01 01:00:00","180.0000000000","0.0000000000","0.0000000000"


第一行是每個字段的名字,後面的行是相應的數據。

如果想用 Hive 進行分析,按照如下方式建表,得到的每一個字段內容都會包含雙引號,不方便分析。

hive > CREATE EXTERNAL TABLE IF NOT EXISTS
 aws_bill (
 InvoiceID string,
 PayerAccountId string,
 LinkedAccountId string,
 RecordType string,
 ProductName string,
 RateId string,
 SubscriptionId string,
 PricingPlanId string,
 UsageType string,
 Operation string,
 AvailabilityZone string,
 ReservedInstance string,
 ItemDescription string,
 UsageStartDate string,
 UsageEndDate string,
 UsageQuantity double,
 Rate double,
 Cost double
)
 ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
 LOCATION 's3://feichashao-hadoop/bill/';

hive> select * from aws_bill
"Estimated","xxxxxxxxxxxx","xxxxxxxxxxxx","LineItem","Amazon Simple Queue Service","16850885","1846142824","1292565","CNN1-Requests-Tier1","GetQueueAttributes","","N","First 1,000,000 Amazon SQS Requests per month are free","2019-01-01 00:00:00","2019-01-01 01:00:00",NULL,NULL,NULL

而期望的結果是,字段中不含有雙引號。

Hive 沒有原生的方法來去除字段中的雙引號。不過我們可以在建表的時候,使用 CSV Serde[2]。
建表的方法如下:

Hive > CREATE EXTERNAL TABLE IF NOT EXISTS
 aws_bill_serde (
 InvoiceID string,
 PayerAccountId string,
    [...省略...] 
 Cost double)
 ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
 LOCATION 's3://feichashao-hadoop/bill/';

這樣建表,select * from aws_bill_serde 得到的結果沒有雙引號。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章