Data Lake Analytics + OSS數據文件格式處理大全

0. 前言

Data Lake Analytics是Serverless化的雲上交互式查詢分析服務。用戶可以使用標準的SQL語句,對存儲在OSS、TableStore上的數據無需移動,直接進行查詢分析。

目前該產品已經正式登陸阿里雲,歡迎大家申請試用,體驗更便捷的數據分析服務。
請參考https://help.aliyun.com/document_detail/70386.html 進行產品開通服務申請。

在上一篇教程中,我們介紹了如何分析CSV格式的TPC-H數據集。除了純文本文件(例如,CSV,TSV等),用戶存儲在OSS上的其他格式的數據文件,也可以使用Data Lake Analytics進行查詢分析,包括ORC, PARQUET, JSON, RCFILE, AVRO甚至ESRI規範的地理JSON數據,還可以用正則表達式匹配的文件等。

本文詳細介紹如何根據存儲在OSS上的文件格式使用Data Lake Analytics (下文簡稱 DLA)進行分析。DLA內置了各種處理文件數據的SerDe(Serialize/Deserilize的簡稱,目的是用於序列化和反序列化)實現,用戶無需自己編寫程序,基本上能選用DLA中的一款或多款SerDe來匹配您OSS上的數據文件格式。如果還不能滿足您特殊文件格式的處理需求,請聯繫我們,儘快爲您實現。

1. 存儲格式與SerDe

用戶可以依據存儲在OSS上的數據文件進行建表,通過STORED AS 指定數據文件的格式。
例如,

CREATE EXTERNAL TABLE nation (
    N_NATIONKEY INT, 
    N_NAME STRING, 
    N_REGIONKEY INT, 
    N_COMMENT STRING
) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' 
STORED AS TEXTFILE 
LOCATION 'oss://test-bucket-julian-1/tpch_100m/nation';

建表成功後可以使用SHOW CREATE TABLE語句查看原始建表語句。

mysql> show create table nation;
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Result                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| CREATE EXTERNAL TABLE `nation`(
  `n_nationkey` int,
  `n_name` string,
  `n_regionkey` int,
  `n_comment` string)
ROW FORMAT DELIMITED
    FIELDS TERMINATED BY '|'
STORED AS `TEXTFILE`
LOCATION
  'oss://test-bucket-julian-1/tpch_100m/nation'|
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (1.81 sec)

下表中列出了目前DLA已經支持的文件格式,當針對下列格式的文件建表時,可以直接使用STORED AS,DLA會選擇合適的SERDE/INPUTFORMAT/OUTPUTFORMAT。

在指定了STORED AS 的同時,還可以根據具體文件的特點,指定SerDe (用於解析數據文件並映射到DLA表),特殊的列分隔符等。
後面的部分會做進一步的講解。

2. 示例

2.1 CSV文件

CSV文件,本質上還是純文本文件,可以使用STORED AS TEXTFILE。
列與列之間以逗號分隔,可以通過ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' 表示。

普通CSV文件

例如,數據文件oss://bucket-for-testing/oss/text/cities/city.csv的內容爲

Beijing,China,010
ShangHai,China,021
Tianjin,China,022

建表語句可以爲

CREATE EXTERNAL TABLE city (
    city STRING, 
    country STRING, 
    code INT
) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' 
STORED AS TEXTFILE 
LOCATION 'oss://bucket-for-testing/oss/text/cities';

使用OpenCSVSerde__處理引號__引用的字段

OpenCSVSerde在使用時需要注意以下幾點:

  1. 用戶可以爲行的字段指定字段分隔符、字段內容引用符號和轉義字符,例如:WITH SERDEPROPERTIES ("separatorChar" = ",", "quoteChar" = "`", "escapeChar" = "" );
  2. 不支持字段內嵌入的行分割符;
  3. 所有字段定義STRING類型;
  4. 其他數據類型的處理,可以在SQL中使用函數進行轉換。
    例如,
CREATE EXTERNAL TABLE test_csv_opencsvserde (
  id STRING,
  name STRING,
  location STRING,
  create_date STRING,
  create_timestamp STRING,
  longitude STRING,
  latitude STRING
) 
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
with serdeproperties(
"separatorChar"=",",
"quoteChar"="\"",
"escapeChar"="\\"
)
STORED AS TEXTFILE LOCATION 'oss://test-bucket-julian-1/test_csv_serde_1';

自定義分隔符

需要自定義列分隔符(FIELDS TERMINATED BY),轉義字符(ESCAPED BY),行結束符(LINES TERMINATED BY)。
需要在建表語句中指定

ROW FORMAT DELIMITED
    FIELDS TERMINATED BY '\t'
    ESCAPED BY '\\'
    LINES TERMINATED BY '\n'

忽略CSV文件中的HEADER

在csv文件中,有時會帶有HEADER信息,需要在數據讀取時忽略掉這些內容。這時需要在建表語句中定義skip.header.line.count。

例如,數據文件oss://my-bucket/datasets/tpch/nation_csv/nation_header.tbl的內容如下:

N_NATIONKEY|N_NAME|N_REGIONKEY|N_COMMENT
0|ALGERIA|0| haggle. carefully final deposits detect slyly agai|
1|ARGENTINA|1|al foxes promise slyly according to the regular accounts. bold requests alon|
2|BRAZIL|1|y alongside of the pending deposits. carefully special packages are about the ironic forges. slyly special |
3|CANADA|1|eas hang ironic, silent packages. slyly regular packages are furiously over the tithes. fluffily bold|
4|EGYPT|4|y above the carefully unusual theodolites. final dugouts are quickly across the furiously regular d|
5|ETHIOPIA|0|ven packages wake quickly. regu|

相應的建表語句爲:

CREATE EXTERNAL TABLE nation_header (
    N_NATIONKEY INT, 
    N_NAME STRING, 
    N_REGIONKEY INT, 
    N_COMMENT STRING
) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' 
STORED AS TEXTFILE 
LOCATION 'oss://my-bucket/datasets/tpch/nation_csv/nation_header.tbl'
TBLPROPERTIES ("skip.header.line.count"="1");

skip.header.line.count的取值x和數據文件的實際行數n有如下關係:

  • 當x<=0時,DLA在讀取文件時,不會過濾掉任何信息,即全部讀取;
  • 當0
  • 當x>=n時,DLA在讀取文件時,會過濾掉所有的文件內容。

2.2 TSV文件

與CSV文件類似,TSV格式的文件也是純文本文件,列與列之間的分隔符爲Tab。

例如,數據文件oss://bucket-for-testing/oss/text/cities/city.tsv的內容爲

Beijing    China    010
ShangHai    China    021
Tianjin    China    022

建表語句可以爲

CREATE EXTERNAL TABLE city (
    city STRING, 
    country STRING, 
    code INT
) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' 
STORED AS TEXTFILE 
LOCATION 'oss://bucket-for-testing/oss/text/cities';

2.3 多字符數據字段分割符文件

假設您的數據字段的分隔符包含多個字符,可採用如下示例建表語句,其中每行的數據字段分割符爲“||”,可以替換爲您具體的分割符字符串。

ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.MultiDelimitSerDe'
with serdeproperties(
"field.delim"="||"
)

示例:

CREATE EXTERNAL TABLE test_csv_multidelimit (
  id STRING,
  name STRING,
  location STRING,
  create_date STRING,
  create_timestamp STRING,
  longitude STRING,
  latitude STRING
) 
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.MultiDelimitSerDe'
with serdeproperties(
"field.delim"="||"
)
STORED AS TEXTFILE LOCATION 'oss://bucket-for-testing/oss/text/cities/';

2.4 JSON文件

DLA可以處理的JSON文件通常以純文本的格式存儲,在建表時除了要指定STORED AS TEXTFILE, 還要定義SERDE。
在JSON文件中,每行必須是一個完整的JSON對象。
例如,下面的文件格式是不被接受的

{"id": 123, "name": "jack", 
"c3": "2001-02-03 12:34:56"}
{"id": 456, "name": "rose", "c3": "1906-04-18 05:12:00"}
{"id": 789, "name": "tom", "c3": "2001-02-03 12:34:56"}
{"id": 234, "name": "alice", "c3": "1906-04-18 05:12:00"}

需要改寫成:

{"id": 123, "name": "jack", "c3": "2001-02-03 12:34:56"}
{"id": 456, "name": "rose", "c3": "1906-04-18 05:12:00"}
{"id": 789, "name": "tom", "c3": "2001-02-03 12:34:56"}
{"id": 234, "name": "alice", "c3": "1906-04-18 05:12:00"}

不含嵌套的JSON數據

建表語句可以寫

CREATE EXTERNAL TABLE t1 (id int, name string, c3 timestamp)
STORED AS JSON
LOCATION 'oss://path/to/t1/directory';
含有嵌套的JSON文件

使用struct和array結構定義嵌套的JSON數據。
例如,用戶原始數據(注意:無論是否嵌套,一條完整的JSON數據都只能放在一行上,才能被Data Lake Analytics處理):

{       "DocId": "Alibaba",         "User_1": {             "Id": 1234,             "Username": "bob1234",          "Name": "Bob",          "ShippingAddress": {                    "Address1": "969 Wenyi West St.",                     "Address2": null,                       "City": "Hangzhou",                      "Province": "Zhejiang"           },              "Orders": [{                            "ItemId": 6789,                                 "OrderDate": "11/11/2017"                       },                      {                               "ItemId": 4352,                                 "OrderDate": "12/12/2017"                       }               ]       } }

使用在線JSON格式化工具格式化後,數據內容如下:

{
    "DocId": "Alibaba", 
    "User_1": {
        "Id": 1234, 
        "Username": "bob1234", 
        "Name": "Bob", 
        "ShippingAddress": {
            "Address1": "969 Wenyi West St.", 
            "Address2": null, 
            "City": "Hangzhou", 
            "Province": "Zhejiang"
        }, 
        "Orders": [
            {
                "ItemId": 6789, 
                "OrderDate": "11/11/2017"
            }, 
            {
                "ItemId": 4352, 
                "OrderDate": "12/12/2017"
            }
        ]
    }
}

則建表語句可以寫成如下(注意:LOCATION中指定的路徑必須是JSON數據文件所在的目錄,該目錄下的所有JSON文件都能被識別爲該表的數據):

CREATE EXTERNAL TABLE json_table_1 (
    docid string,
    user_1 struct<
            id:INT,
            username:string,
            name:string,
            shippingaddress:struct<
                            address1:string,
                            address2:string,
                            city:string,
                            province:string
                            >,
            orders:array<
                    struct<
                        itemid:INT,
                        orderdate:string
                    >
            >
    >
)
STORED AS JSON
LOCATION 'oss://xxx/test/json/hcatalog_serde/table_1/';

對該表進行查詢:

select * from json_table_1;

+---------+----------------------------------------------------------------------------------------------------------------+
| docid   | user_1                                                                                                         |
+---------+----------------------------------------------------------------------------------------------------------------+
| Alibaba | [1234, bob1234, Bob, [969 Wenyi West St., null, Hangzhou, Zhejiang], [[6789, 11/11/2017], [4352, 12/12/2017]]] |
+---------+----------------------------------------------------------------------------------------------------------------+

對於struct定義的嵌套結構,可以通過“.”進行層次對象引用,對於array定義的數組結構,可以通過“[數組下標]”(注意:數組下標從1開始)進行對象引用。

select DocId,
       User_1.Id,
       User_1.ShippingAddress.Address1,
       User_1.Orders[1].ItemId
from json_table_1
where User_1.Username = 'bob1234'
  and User_1.Orders[2].OrderDate = '12/12/2017';

+---------+------+--------------------+-------+
| DocId   | id   | address1           | _col3 |
+---------+------+--------------------+-------+
| Alibaba | 1234 | 969 Wenyi West St. |  6789 |
+---------+------+--------------------+-------+

使用JSON函數處理數據

例如,把“value_string”的嵌套JSON值作爲字符串存儲:

{"data_key":"com.taobao.vipserver.domains.meta.biz.alibaba.com","ts":1524550275112,"value_string":"{\"appName\":\"\",\"apps\":[],\"checksum\":\"50fa0540b430904ee78dff07c7350e1c\",\"clusterMap\":{\"DEFAULT\":{\"defCkport\":80,\"defIPPort\":80,\"healthCheckTask\":null,\"healthChecker\":{\"checkCode\":200,\"curlHost\":\"\",\"curlPath\":\"/status.taobao\",\"type\":\"HTTP\"},\"name\":\"DEFAULT\",\"nodegroup\":\"\",\"sitegroup\":\"\",\"submask\":\"0.0.0.0/0\",\"syncConfig\":{\"appName\":\"trade-ma\",\"nodegroup\":\"tradema\",\"pubLevel\":\"publish\",\"role\":\"\",\"site\":\"\"},\"useIPPort4Check\":true}},\"disabledSites\":[],\"enableArmoryUnit\":false,\"enableClientBeat\":false,\"enableHealthCheck\":true,\"enabled\":true,\"envAndSites\":\"\",\"invalidThreshold\":0.6,\"ipDeleteTimeout\":1800000,\"lastModifiedMillis\":1524550275107,\"localSiteCall\":true,\"localSiteThreshold\":0.8,\"name\":\"biz.alibaba.com\",\"nodegroup\":\"\",\"owners\":[\"junlan.zx\",\"張三\",\"李四\",\"cui.yuanc\"],\"protectThreshold\":0,\"requireSameEnv\":false,\"resetWeight\":false,\"symmetricCallType\":null,\"symmetricType\":\"warehouse\",\"tagName\":\"ipGroup\",\"tenantId\":\"\",\"tenants\":[],\"token\":\"1cf0ec0c771321bb4177182757a67fb0\",\"useSpecifiedURL\":false}"}

使用在線JSON格式化工具格式化後,數據內容如下:

{
    "data_key": "com.taobao.vipserver.domains.meta.biz.alibaba.com", 
    "ts": 1524550275112, 
    "value_string": "{\"appName\":\"\",\"apps\":[],\"checksum\":\"50fa0540b430904ee78dff07c7350e1c\",\"clusterMap\":{\"DEFAULT\":{\"defCkport\":80,\"defIPPort\":80,\"healthCheckTask\":null,\"healthChecker\":{\"checkCode\":200,\"curlHost\":\"\",\"curlPath\":\"/status.taobao\",\"type\":\"HTTP\"},\"name\":\"DEFAULT\",\"nodegroup\":\"\",\"sitegroup\":\"\",\"submask\":\"0.0.0.0/0\",\"syncConfig\":{\"appName\":\"trade-ma\",\"nodegroup\":\"tradema\",\"pubLevel\":\"publish\",\"role\":\"\",\"site\":\"\"},\"useIPPort4Check\":true}},\"disabledSites\":[],\"enableArmoryUnit\":false,\"enableClientBeat\":false,\"enableHealthCheck\":true,\"enabled\":true,\"envAndSites\":\"\",\"invalidThreshold\":0.6,\"ipDeleteTimeout\":1800000,\"lastModifiedMillis\":1524550275107,\"localSiteCall\":true,\"localSiteThreshold\":0.8,\"name\":\"biz.alibaba.com\",\"nodegroup\":\"\",\"owners\":[\"junlan.zx\",\"張三\",\"李四\",\"cui.yuanc\"],\"protectThreshold\":0,\"requireSameEnv\":false,\"resetWeight\":false,\"symmetricCallType\":null,\"symmetricType\":\"warehouse\",\"tagName\":\"ipGroup\",\"tenantId\":\"\",\"tenants\":[],\"token\":\"1cf0ec0c771321bb4177182757a67fb0\",\"useSpecifiedURL\":false}"
}

建表語句爲

CREATE external TABLE json_table_2 (
   data_key string,
   ts bigint,
   value_string string
)
STORED AS JSON
LOCATION 'oss://xxx/test/json/hcatalog_serde/table_2/';

表建好後,可進行查詢:

select * from json_table_2;

+---------------------------------------------------+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| data_key                                          | ts            | value_string                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
+---------------------------------------------------+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| com.taobao.vipserver.domains.meta.biz.alibaba.com | 1524550275112 | {"appName":"","apps":[],"checksum":"50fa0540b430904ee78dff07c7350e1c","clusterMap":{"DEFAULT":{"defCkport":80,"defIPPort":80,"healthCheckTask":null,"healthChecker":{"checkCode":200,"curlHost":"","curlPath":"/status.taobao","type":"HTTP"},"name":"DEFAULT","nodegroup":"","sitegroup":"","submask":"0.0.0.0/0","syncConfig":{"appName":"trade-ma","nodegroup":"tradema","pubLevel":"publish","role":"","site":""},"useIPPort4Check":true}},"disabledSites":[],"enableArmoryUnit":false,"enableClientBeat":false,"enableHealthCheck":true,"enabled":true,"envAndSites":"","invalidThreshold":0.6,"ipDeleteTimeout":1800000,"lastModifiedMillis":1524550275107,"localSiteCall":true,"localSiteThreshold":0.8,"name":"biz.alibaba.com","nodegroup":"","owners":["junlan.zx","張三","李四","cui.yuanc"],"protectThreshold":0,"requireSameEnv":false,"resetWeight":false,"symmetricCallType":null,"symmetricType":"warehouse","tagName":"ipGroup","tenantId":"","tenants":[],"token":"1cf0ec0c771321bb4177182757a67fb0","useSpecifiedURL":false}       |
+---------------------------------------------------+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

下面SQL示例json_parse,json_extract_scalar,json_extract等常用JSON函數的使用方式:

mysql> select json_extract_scalar(json_parse(value), '$.owners[1]') from json_table_2;

+--------+
| _col0  |
+--------+
| 張三    |
+--------+

mysql> select json_extract_scalar(json_obj.json_col, '$.DEFAULT.submask') 
from (
  select json_extract(json_parse(value), '$.clusterMap') as json_col from json_table_2
) json_obj
where json_extract_scalar(json_obj.json_col, '$.DEFAULT.healthChecker.curlPath') = '/status.taobao';

+-----------+
| _col0     |
+-----------+
| 0.0.0.0/0 |
+-----------+

mysql> with json_obj as (select json_extract(json_parse(value), '$.clusterMap') as json_col from json_table_2)
select json_extract_scalar(json_obj.json_col, '$.DEFAULT.submask')
from json_obj 
where json_extract_scalar(json_obj.json_col, '$.DEFAULT.healthChecker.curlPath') = '/status.taobao';

+-----------+
| _col0     |
+-----------+
| 0.0.0.0/0 |
+-----------+

2.5 ORC文件

Optimized Row Columnar(ORC)是Apache開源項目Hive支持的一種優化的列存儲文件格式。與CSV文件相比,不僅可以節省存儲空間,還可以得到更好的查詢性能。

對於ORC文件,只需要在建表時指定 STORED AS ORC。
例如,

CREATE EXTERNAL TABLE orders_orc_date (
    O_ORDERKEY INT, 
    O_CUSTKEY INT, 
    O_ORDERSTATUS STRING, 
    O_TOTALPRICE DOUBLE, 
    O_ORDERDATE DATE, 
    O_ORDERPRIORITY STRING, 
    O_CLERK STRING, 
    O_SHIPPRIORITY INT, 
    O_COMMENT STRING
) 
STORED AS ORC 
LOCATION 'oss://bucket-for-testing/datasets/tpch/1x/orc_date/orders_orc';

2.6 PARQUET文件

Parquet是Apache開源項目Hadoop支持的一種列存儲的文件格式。
使用DLA建表時,需要指定STORED AS PARQUET即可。
例如,

CREATE EXTERNAL TABLE orders_parquet_date (
    O_ORDERKEY INT, 
    O_CUSTKEY INT, 
    O_ORDERSTATUS STRING, 
    O_TOTALPRICE DOUBLE, 
    O_ORDERDATE DATE, 
    O_ORDERPRIORITY STRING, 
    O_CLERK STRING, 
    O_SHIPPRIORITY INT, 
    O_COMMENT STRING
) 
STORED AS PARQUET 
LOCATION 'oss://bucket-for-testing/datasets/tpch/1x/parquet_date/orders_parquet';

2.7 RCFILE文件

Record Columnar File (RCFile), 列存儲文件,可以有效地將關係型表結構存儲在分佈式系統中,並且可以被高效地讀取和處理。
DLA在建表時,需要指定STORED AS RCFILE。
例如,

CREATE EXTERNAL TABLE lineitem_rcfile_date (
    L_ORDERKEY INT, 
    L_PARTKEY INT, 
    L_SUPPKEY INT, 
    L_LINENUMBER INT, 
    L_QUANTITY DOUBLE, 
    L_EXTENDEDPRICE DOUBLE, 
    L_DISCOUNT DOUBLE, 
    L_TAX DOUBLE, 
    L_RETURNFLAG STRING, 
    L_LINESTATUS STRING, 
    L_SHIPDATE DATE, 
    L_COMMITDATE DATE, 
    L_RECEIPTDATE DATE, 
    L_SHIPINSTRUCT STRING, 
    L_SHIPMODE STRING, 
    L_COMMENT STRING
) 
STORED AS RCFILE
LOCATION 'oss://bucke-for-testing/datasets/tpch/1x/rcfile_date/lineitem_rcfile'

2.8 AVRO文件

DLA針對AVRO文件建表時,需要指定STORED AS AVRO,並且定義的字段需要符合AVRO文件的schema。

如果不確定可以通過使用Avro提供的工具,獲得schema,並根據schema建表。
Apache Avro官網下載avro-tools-.jar到本地,執行下面的命令獲得Avro文件的schema:

java -jar avro-tools-1.8.2.jar getschema /path/to/your/doctors.avro
{
  "type" : "record",
  "name" : "doctors",
  "namespace" : "testing.hive.avro.serde",
  "fields" : [ {
    "name" : "number",
    "type" : "int",
    "doc" : "Order of playing the role"
  }, {
    "name" : "first_name",
    "type" : "string",
    "doc" : "first name of actor playing role"
  }, {
    "name" : "last_name",
    "type" : "string",
    "doc" : "last name of actor playing role"
  } ]
}

建表語句如下,其中fields中的name對應表中的列名,type需要參考本文檔中的表格轉成hive支持的類型

CREATE EXTERNAL TABLE doctors(
number int,
first_name string,
last_name string)
STORED AS AVRO
LOCATION 'oss://mybucket-for-testing/directory/to/doctors';

大多數情況下,Avro的類型可以直接轉換成Hive中對應的類型。如果該類型在Hive不支持,則會轉換成接近的類型。具體請參照下表:

2.9 可以用正則表達式匹配的文件

通常此類型的文件是以純文本格式存儲在OSS上的,每一行代表表中的一條記錄,並且每行可以用正則表達式匹配。
例如,Apache WebServer日誌文件就是這種類型的文件。

某日誌文件的內容爲:

127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326
127.0.0.1 - - [26/May/2009:00:00:00 +0000] "GET /someurl/?track=Blabla(Main) HTTP/1.1" 200 5864 - "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/525.19 (KHTML, like Gecko) Chrome/1.0.154.65 Safari/525.19"

每行文件可以用下面的正則表達式表示,列之間使用空格分隔:

([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\"[^\"]*\") ([^ \"]*|\"[^\"]*\"))?

針對上面的文件格式,建表語句可以表示爲:

CREATE EXTERNAL TABLE serde_regex(
  host STRING,
  identity STRING,
  userName STRING,
  time STRING,
  request STRING,
  status STRING,
  size INT,
  referer STRING,
  agent STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
  "input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\"[^\"]*\") ([^ \"]*|\"[^\"]*\"))?"
)
STORED AS TEXTFILE
LOCATION 'oss://bucket-for-testing/datasets/serde/regex';

查詢結果

mysql> select * from serde_regex;
+-----------+----------+-------+------------------------------+---------------------------------------------+--------+------+---------+--------------------------------------------------------------------------------------------------------------------------+
| host      | identity | userName | time                         | request                                     | status | size | referer | agent                                                                                                                    |
+-----------+----------+-------+------------------------------+---------------------------------------------+--------+------+---------+--------------------------------------------------------------------------------------------------------------------------+
| 127.0.0.1 | -        | frank | [10/Oct/2000:13:55:36 -0700] | "GET /apache_pb.gif HTTP/1.0"               | 200    | 2326 | NULL    | NULL                                                                                                                     |
| 127.0.0.1 | -        | -     | [26/May/2009:00:00:00 +0000] | "GET /someurl/?track=Blabla(Main) HTTP/1.1" | 200    | 5864 | -       | "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/525.19 (KHTML, like Gecko) Chrome/1.0.154.65 Safari/525.19" |
+-----------+----------+-------+------------------------------+---------------------------------------------+--------+------+---------+--------------------------------------------------------------------------------------------------------------------------+

2.10 Esri ArcGIS的地理JSON數據文件

DLA支持Esri ArcGIS的地理JSON數據文件的SerDe處理,關於這種地理JSON數據格式說明,可以參考:https://github.com/Esri/spatial-framework-for-hadoop/wiki/JSON-Formats

示例:

CREATE EXTERNAL TABLE IF NOT EXISTS california_counties
(
    Name string,
    BoundaryShape binary
)
ROW FORMAT SERDE 'com.esri.hadoop.hive.serde.JsonSerde'
STORED AS INPUTFORMAT 'com.esri.json.hadoop.EnclosedJsonInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 'oss://test_bucket/datasets/geospatial/california-counties/'

3. 總結

通過以上例子可以看出,DLA可以支持大部分開源存儲格式的文件。對於同一份數據,使用不同的存儲格式,在OSS中存儲文件的大小,DLA的查詢分析速度上會有較大的差別。推薦使用ORC格式進行文件的存儲和查詢。

爲了獲得更快的查詢速度,DLA還在不斷的優化中,後續也會支持更多的數據源,爲用戶帶來更好的大數據分析體驗。



本文作者:金絡

閱讀原文

本文爲雲棲社區原創內容,未經允許不得轉載。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章