簡介
Apache Drill is a low latency distributed query engine for large-scale datasets, including structured and semi-structured/nested
data.
Apache Drill的用途:Drill是SQL查詢引擎,可以構建在幾乎所有的NoSQL數據庫或文件系統(如:Hive, HDFS, mongo db, Amazon S3等)上,用來加速查詢,比如,我們所熟知的Hive,用於在hdfs進行類SQL查詢,但是利用Hive的速度比較慢,因此可以利用Drill一類的查詢引擎加速查詢,用於分佈式大數據的實時查詢等場景。
架構
drill 通過 Storage plugin interface 即插件的形式實現在不同的數據源上構建查詢引擎。
安裝,分爲嵌入模式與分佈模式。
這裏介紹linux下嵌入模式的安裝:
嵌入模式無需做相關配置,較簡便,首先要安裝JDK 7;
進入到待安裝目錄,打開shell;
下載安裝包,運行以下命令中其中一條:
wget http://getdrill.org/drill/download/apache-drill-1.1.0.tar.gz
或curl -o apache-drill-1.1.0.tar.gz http://getdrill.org/drill/download/apache-drill-1.1.0.tar.gz
下載文件到待安裝目錄後(或下載後移動至安裝目錄);
解壓縮安裝包,執行命令tar -xvzf <.tar.gz file name>
解壓縮後,進入目錄,此處解壓過後的目錄爲apache-drill-1.1.0,執行命令 bin/drill,如圖
此時可能會報錯,顯示內存不足,這裏可以在子目錄conf中修改drill-env.sh文件中的默認內存分配設置即可,默認是4G,對於一般家用機器,必然會報錯。
即啓動嵌入模式drill。
上圖中最後一行表明drill已啓動,可以開始執行查詢,最後一行命令提示符的含義爲,0表示連接數,jdbc表示連接類型,zk=local表示使用ZooKeeper本地節點。
退出命令 !quit
drill web訪問接口,在瀏覽器輸入 http://<IP address or host name>:8047 即可,訪問效果如圖:
以上我們安裝好了drill工具,但是並未將其與我們的特定數據源關聯,以下我們進行相關配置,使其可以對具體數據執行查詢。
1. 內存配置,如上,修改,在drill-env.sh中修改參數 XX:MaxDirectMemorySize 即可。
2. 配置多用戶設置
3. 配置用戶權限與角色
4. 。。。待續
連接數據源
dril連接數據源,通過存儲插件形式,這樣增加了靈活性,對於不同的數據源,通過插件實現多數據源的兼容,drill可以連接數據庫,文件,分佈式文件系統,hive metastore等。
可以通過三種方式指定配置存儲插件配置:
(1) 通過查詢中的FROM語句
(2) 在查詢語句前使用USE命令
(3) 在啓動drill時指定
Web配置方式
可以在 http://<IP address>:8047/storage 查看和配置存儲插件,存在以下選項,
cp連接jar file
dfs連接本地文件系統或任何分佈式文件系統,如hadoop,amazon s3等
hbase連接Hbase
hive連接hive metastore
mongo連接MongoDB
點擊進入update選項,可以配置數據格式等選項,
可以輸入存儲插件名字創建新的存儲插件,如圖
dfs插件配置示例,如圖
drill插件可配置屬性介紹
Attribute | Example Values | Required | Description |
---|---|---|---|
"type" |
"file" "hbase" "hive" "mongo" |
yes | A valid storage plugin type name. |
"enabled" |
true false |
yes | State of the storage plugin. |
"connection" |
"classpath:///" "file:///" "mongodb://localhost:27017/" "hdfs://" |
implementation-dependent | The type of distributed file system, such as HDFS, Amazon S3, or files in your file system, and an address/path name. |
"workspaces" |
null "logs" |
no | One or more unique workspace names. If a workspace name is used more than once, only the last definition is effective. |
"workspaces". . . "location" |
"location": "/Users/johndoe/mydata" "location": "/tmp" |
no | Full path to a directory on the file system. |
"workspaces". . . "writable" |
true false |
no | One or more unique workspace names. If defined more than once, the last workspace name overrides the others. |
"workspaces". . . "defaultInputFormat" |
null "parquet" "csv" "json" |
no | Format for reading data, regardless of extension. Default = "parquet" |
"formats" |
"psv" "csv" "tsv" "parquet" "json" "avro" "maprdb" * |
yes | One or more valid file formats for reading. Drill implicitly detects formats of some files based on extension or bits of data in the file; others require configuration. |
"formats" . . . "type" |
"text" "parquet" "json" "maprdb" * |
yes | Format type. You can define two formats, csv and psv, as type "Text", but having different delimiters. |
formats . . . "extensions" | ["csv"] | format-dependent | File name extensions that Drill can read. |
"formats" . . . "delimiter" |
"\t" "," |
format-dependent | Sequence of one or more characters that serve as a record separator in a delimited text file, such as CSV. Use a 4-digit hex code syntax \uXXXX for a non-printable delimiter. |
"formats" . . . "quote" | """ | no | A single character that starts/ends a value in a delimited text file. |
"formats" . . . "escape" | "`" | no | A single character that escapes a quotation mark inside a value. |
"formats" . . . "comment" | "#" | no | The line decoration that starts a comment line in the delimited text file. |
"formats" . . . "skipFirstLine" | true | no | To include or omit the header when reading a delimited text file. Set to true to avoid reading headers as data. |
也可以通過Drill Rest API進行插件配置,使用POST方式傳遞名字和配置兩個屬性,例如
curl -X POST -/json" -d '{"name":"myplugin", "config": {"type": "file", "enabled": false, "connection": "file:///", "workspaces": { "root": { "location": "/", "writable": false, "defaultInputFormat": null}}, "formats": null}}' https://localhost:8047/storage/myplugin.json
上面命令創建一個名爲myplugin的插件,用於查詢本地文件系統根目錄的未知文件類型。
介紹連接hive的配置,首先確保hive metastore服務啓動,hive.metastore.uris
: hive
--service metastore
進入Drill Web接口,進入Store選項卡, http://<IP address>:8047/storage
點擊hive旁的update選項,進行配置,如圖
進入配置界面,在默認內容上添加如下,