hive 非正確json格式字段造成查詢錯誤

1. 問題

hive查詢報錯:

Diagnostic Messages for this Task:
[2020-04-02 05:32:04,360] {bash_operator.py:110} INFO - Error: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row [Error getting row data with exception java.lang.ClassCastException: java.lang.String cannot be cast to org.openx.data.jsonserde.json.JSONObject
[2020-04-02 05:32:04,360] {bash_operator.py:110} INFO - 	at org.openx.data.jsonserde.objectinspector.JsonMapObjectInspector.getMap(JsonMapObjectInspector.java:40)
[2020-04-02 05:32:04,361] {bash_operator.py:110} INFO - 	at org.apache.hadoop.hive.serde2.SerDeUtils.buildJSONString(SerDeUtils.java:318)
[2020-04-02 05:32:04,361] {bash_operator.py:110} INFO - 	at org.apache.hadoop.hive.serde2.SerDeUtils.buildJSONString(SerDeUtils.java:354)
[2020-04-02 05:32:04,361] {bash_operator.py:110} INFO - 	at org.apache.hadoop.hive.serde2.SerDeUtils.getJSONString(SerDeUtils.java:198)
[2020-04-02 05:32:04,361] {bash_operator.py:110} INFO - 	at org.apache.hadoop.hive.serde2.SerDeUtils.getJSONString(SerDeUtils.java:184)
[2020-04-02 05:32:04,361] {bash_operator.py:110} INFO - 	at org.apache.hadoop.hive.ql.exec.MapOperator.toErrorMessage(MapOperator.java:544)
[2020-04-02 05:32:04,361] {bash_operator.py:110} INFO - 	at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:513)
[2020-04-02 05:32:04,361] {bash_operator.py:110} INFO - 	at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:163)
[2020-04-02 05:32:04,362] {bash_operator.py:110} INFO - 	at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
[2020-04-02 05:32:04,362] {bash_operator.py:110} INFO - 	at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453)
[2020-04-02 05:32:04,362] {bash_operator.py:110} INFO - 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
[2020-04-02 05:32:04,362] {bash_operator.py:110} INFO - 	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
[2020-04-02 05:32:04,362] {bash_operator.py:110} INFO - 	at java.security.AccessController.doPrivileged(Native Method)
[2020-04-02 05:32:04,363] {bash_operator.py:110} INFO - 	at javax.security.auth.Subject.doAs(Subject.java:422)
[2020-04-02 05:32:04,363] {bash_operator.py:110} INFO - 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
[2020-04-02 05:32:04,363] {bash_operator.py:110} INFO - 	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
[2020-04-02 05:32:04,363] {bash_operator.py:110} INFO -  ]
[2020-04-02 05:32:04,363] {bash_operator.py:110} INFO - 	at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:172)
[2020-04-02 05:32:04,364] {bash_operator.py:110} INFO - 	at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
[2020-04-02 05:32:04,364] {bash_operator.py:110} INFO - 	at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453)
[2020-04-02 05:32:04,364] {bash_operator.py:110} INFO - 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
[2020-04-02 05:32:04,364] {bash_operator.py:110} INFO - 	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
[2020-04-02 05:32:04,364] {bash_operator.py:110} INFO - 	at java.security.AccessController.doPrivileged(Native Method)
[2020-04-02 05:32:04,364] {bash_operator.py:110} INFO - 	at javax.security.auth.Subject.doAs(Subject.java:422)
[2020-04-02 05:32:04,364] {bash_operator.py:110} INFO - 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
[2020-04-02 05:32:04,364] {bash_operator.py:110} INFO - 	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
[2020-04-02 05:32:04,364] {bash_operator.py:110} INFO - Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row [Error getting row data with exception java.lang.ClassCastException: java.lang.String cannot be cast to org.openx.data.jsonserde.json.JSONObject
[2020-04-02 05:32:04,365] {bash_operator.py:110} INFO - 	at org.openx.data.jsonserde.objectinspector.JsonMapObjectInspector.getMap(JsonMapObjectInspector.java:40)
[2020-04-02 05:32:04,365] {bash_operator.py:110} INFO - 	at org.apache.hadoop.hive.serde2.SerDeUtils.buildJSONString(SerDeUtils.java:318)
[2020-04-02 05:32:04,365] {bash_operator.py:110} INFO - 	at org.apache.hadoop.hive.serde2.SerDeUtils.buildJSONString(SerDeUtils.java:354)
[2020-04-02 05:32:04,365] {bash_operator.py:110} INFO - 	at org.apache.hadoop.hive.serde2.SerDeUtils.getJSONString(SerDeUtils.java:198)
[2020-04-02 05:32:04,365] {bash_operator.py:110} INFO - 	at org.apache.hadoop.hive.serde2.SerDeUtils.getJSONString(SerDeUtils.java:184)
[2020-04-02 05:32:04,365] {bash_operator.py:110} INFO - 	at org.apache.hadoop.hive.ql.exec.MapOperator.toErrorMessage(MapOperator.java:544)
[2020-04-02 05:32:04,365] {bash_operator.py:110} INFO - 	at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:513)
[2020-04-02 05:32:04,365] {bash_operator.py:110} INFO - 	at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:163)
[2020-04-02 05:32:04,365] {bash_operator.py:110} INFO - 	at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
[2020-04-02 05:32:04,366] {bash_operator.py:110} INFO - 	at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453)
[2020-04-02 05:32:04,366] {bash_operator.py:110} INFO - 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
[2020-04-02 05:32:04,366] {bash_operator.py:110} INFO - 	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
[2020-04-02 05:32:04,366] {bash_operator.py:110} INFO - 	at java.security.AccessController.doPrivileged(Native Method)
[2020-04-02 05:32:04,366] {bash_operator.py:110} INFO - 	at javax.security.auth.Subject.doAs(Subject.java:422)
[2020-04-02 05:32:04,366] {bash_operator.py:110} INFO - 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
[2020-04-02 05:32:04,366] {bash_operator.py:110} INFO - 	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
[2020-04-02 05:32:04,366] {bash_operator.py:110} INFO -  ]
[2020-04-02 05:32:04,366] {bash_operator.py:110} INFO - 	at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:518)
[2020-04-02 05:32:04,367] {bash_operator.py:110} INFO - 	at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:163)
[2020-04-02 05:32:04,367] {bash_operator.py:110} INFO - 	... 8 more
[2020-04-02 05:32:04,367] {bash_operator.py:110} INFO - Caused by: java.lang.RuntimeException: Parquet record is malformed: java.lang.String cannot be cast to org.openx.data.jsonserde.json.JSONObject
[2020-04-02 05:32:04,367] {bash_operator.py:110} INFO - 	at org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.write(DataWritableWriter.java:64)
[2020-04-02 05:32:04,367] {bash_operator.py:110} INFO - 	at org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriteSupport.write(DataWritableWriteSupport.java:59)
[2020-04-02 05:32:04,367] {bash_operator.py:110} INFO - 	at org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriteSupport.write(DataWritableWriteSupport.java:31)
[2020-04-02 05:32:04,367] {bash_operator.py:110} INFO - 	at parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:121)
[2020-04-02 05:32:04,367] {bash_operator.py:110} INFO - 	at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:123)
[2020-04-02 05:32:04,367] {bash_operator.py:110} INFO - 	at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:42)
[2020-04-02 05:32:04,368] {bash_operator.py:110} INFO - 	at org.apache.hadoop.hive.ql.io.parquet.write.ParquetRecordWriterWrapper.write(ParquetRecordWriterWrapper.java:111)
[2020-04-02 05:32:04,368] {bash_operator.py:110} INFO - 	at org.apache.hadoop.hive.ql.io.parquet.write.ParquetRecordWriterWrapper.write(ParquetRecordWriterWrapper.java:124)
[2020-04-02 05:32:04,368] {bash_operator.py:110} INFO - 	at org.apache.hadoop.hive.ql.exec.FileSinkOperator.process(FileSinkOperator.java:753)
[2020-04-02 05:32:04,368] {bash_operator.py:110} INFO - 	at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:837)
[2020-04-02 05:32:04,368] {bash_operator.py:110} INFO - 	at org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:88)
[2020-04-02 05:32:04,368] {bash_operator.py:110} INFO - 	at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:837)
[2020-04-02 05:32:04,368] {bash_operator.py:110} INFO - 	at org.apache.hadoop.hive.ql.exec.TableScanOperator.process(TableScanOperator.java:97)
[2020-04-02 05:32:04,368] {bash_operator.py:110} INFO - 	at org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.forward(MapOperator.java:162)
[2020-04-02 05:32:04,368] {bash_operator.py:110} INFO - 	at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:508)
[2020-04-02 05:32:04,368] {bash_operator.py:110} INFO - 	... 9 more
[2020-04-02 05:32:04,369] {bash_operator.py:110} INFO - Caused by: java.lang.ClassCastException: java.lang.String cannot be cast to org.openx.data.jsonserde.json.JSONObject
[2020-04-02 05:32:04,369] {bash_operator.py:110} INFO - 	at org.openx.data.jsonserde.objectinspector.JsonMapObjectInspector.getMap(JsonMapObjectInspector.java:40)
[2020-04-02 05:32:04,369] {bash_operator.py:110} INFO - 	at org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.writeMap(DataWritableWriter.java:211)
[2020-04-02 05:32:04,369] {bash_operator.py:110} INFO - 	at org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.writeValue(DataWritableWriter.java:116)
[2020-04-02 05:32:04,369] {bash_operator.py:110} INFO - 	at org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.writeGroupFields(DataWritableWriter.java:89)
[2020-04-02 05:32:04,369] {bash_operator.py:110} INFO - 	at org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.write(DataWritableWriter.java:60)
[2020-04-02 05:32:04,369] {bash_operator.py:110} INFO - 	... 23 more
[2020-04-02 05:32:04,369] {bash_operator.py:110} INFO - 
[2020-04-02 05:32:04,369] {bash_operator.py:110} INFO - 
[2020-04-02 05:32:04,391] {bash_operator.py:110} INFO - FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask

2. 問題追蹤

錯誤發生在對錶dw.rec_click_jv的查詢,該表直接加載json格式的hdfs文件,建表語句:

CREATE EXTERNAL TABLE IF NOT EXISTS dw.rec_click_jv${env_suffix}(
  u_timestamp STRING COMMENT '東八區時間',
  bucketlist array<STRING> COMMENT 'AB實驗策略名',
  ext map<STRING,STRING> COMMENT '擴展',
  algInfo map<STRING,STRING> COMMENT '視頻分發策略',
  u_bigger_json STRING COMMENT '最後的大json'
)
COMMENT '推薦點擊日誌'
PARTITIONED BY(dt STRING)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES ("ignore.malformed.json" = "true")
LOCATION '/rec/click_jv${env_suffix}/';

進一步追蹤,發現在查詢字段algInfo時纔會出現此錯誤,該字段格式爲:map<STRING,STRING>,遇到數據內容爲非法json時便會拋出異常:

java.lang.RuntimeException: Parquet record is malformed: java.lang.String cannot be cast to org.openx.data.jsonserde.json.JSONObject

3. 解決方案

將dw.rec_click_jv表的algInfo字段的數據類型修改爲STRING,即可解決查詢異常問題,但下游表仍有將該字段的數據類型定義爲map<STRING,STRING>的需求,可以通過hive函數str_to_map直接將STRING類型的字段轉換爲map後寫入。

下游表定義:


CREATE EXTERNAL TABLE IF NOT EXISTS edw.rec_video_click${env_suffix}(
    -- 請求信息
    u_timestamp TIMESTAMP COMMENT '東八區時間',
    u_bucketlist array<STRING> COMMENT 'AB實驗策略名',
    u_ext map<STRING,STRING> COMMENT '擴展',
    u_algInfo map<STRING,STRING> COMMENT '視頻分發策略'
)
COMMENT 'EDW——APP推薦視頻點擊'
PARTITIONED BY(dt STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\001'
collection items terminated by ','
map keys terminated by ':'
STORED AS PARQUET
LOCATION '/dw/edw/rec_video_click${env_suffix}'
TBLPROPERTIES ('parquet.compress'='SNAPPY')
;

str_to_map函數定義如下:

str_to_map(字符串參數, 分隔符1, 分隔符2)
使用兩個分隔符將文本拆分爲鍵值對。
分隔符1將文本分成K-V對,分隔符2分割每個K-V對。
對於分隔符1默認分隔符是 ',',對於分隔符2默認分隔符是 '='。

需要注意的是,在使用str_to_map前 ,需要將數據中的json字符串中的花括號、雙引號、逗號、冒號等替換掉:

原始數據:
{"reasons":"","ruletags":"(264,first_cat),(vitemcf,retrieve)","lasthitrule":"-1","vfactors":"vitemcf,vyoutubednn"}

正確的打開方式:
insert OVERWRITE table edw.rec_video_click PARTITION(dt='2020-04-01')
select
    ...
    if(algInfo is not null, str_to_map(regexp_replace(regexp_replace(regexp_replace(regexp_replace(algInfo,'\",\"','@@@@@'),'\":\"','#####'),'\\{\"',''),'\"\\}',''), '@@@@@', '#####'), null)
from dw.rec_click_jv
where dt='2020-04-01'
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章