select … from clause:
1.1array類型:
hive> SELECT name, subordinates FROM employees;
John Doe ["Mary Smith","Todd Jones"]
Todd Jones []
--( ‘subordinates’爲array類型,並且字符串帶有雙引號。)
hive> SELECT name, subordinates[0] FROM employees;
John Doe Mary Smith
Todd Jones NULL
--(可以和java一樣進行索引選擇,此時字符串沒有雙引號。空值返回NULL)
1.2 map類型:
hive> SELECT name, deductions FROM employees;
John Doe {"Federal Taxes":0.2,"State Taxes":0.05,"Insurance":0.1}
hive> SELECT name, deductions["State Taxes"] FROM mployees;
John Doe 0.05
--(’deductions’爲map類型,也可以和Java一樣對map通過鍵找到值)
1.3 struct類型:
hive> SELECT name, address FROM employees;
John Doe {"street":"1 Michigan Ave.","city":"Chicago","state":"IL","zip":60600}
hive> SELECT name, address.city FROM employees;
John Doe Chicago
--(struct類型的成員引用和c的一樣。)
2.LIKE 和RLIKE區別:
(1)LIKE只支持‘_’和‘%’,其中前者代表匹配一位佔位符,或者可以是一或多個佔位符。
(2)RLIKE是regular like的縮寫,顧名思義,其支持和Java一樣的正則表達式,比like功能更強大。
3.在select中使用聚集函數時記得設置:
hive> SET hive.map.aggr=true;
hive> SELECT count(*), avg(salary) FROM employees;
hive> SELECT count(DISTINCT symbol) FROM stocks;
4.Table generating functions
(1)生成單列表:
hive> SELECT explode(subordinates) AS sub FROM employees;
Mary Smith
Todd Jones
(2)生成多列表:
SELECT parse_url_tuple(url, 'HOST', 'PATH', 'QUERY') as (host, path, query)
FROM url_table;
--(其他生成功能函數見page 87 of programming hive )
5. 嵌套查詢:
hive> FROM (
> SELECT upper(name), salary, deductions["Federal Taxes"] as fed_taxes,
> FROM employees
> ) e
> SELECT e.name, e.salary_minus_fed_taxes
> WHERE e.salary_minus_fed_taxes > 70000 limit 2;
JOHN DOE 100000.0 0.2 80000
--(列的alias用‘as’,表的alias不用‘as’)
6.CASE …. WHEN字句:
hive> SELECT name, salary,
> CASE
> WHEN salary < 50000.0 THEN 'low'
> WHEN salary >= 50000.0 AND salary < 70000.0 THEN 'middle'
> WHEN salary >= 70000.0 AND salary < 100000.0 THEN 'high'
> ELSE 'very high'
> END AS bracket FROM employees;
John Doe 100000.0 very high
Mary Smith 80000.0 high
--(CASE… WHEN…THEN… ELSE …END 相當於if … else…字句)
7. When Hive Can Avoid MapReduce
set hive.exec.mode.local.auto=true;
--( Hive will attempt to run other operations in local mode,比如:
Select * from aa.)
--(同樣適用於分區表:SELECT * FROM employees
WHERE country = 'US' AND state = 'CA'
LIMIT 100;)
8. Gotchas with Floating-Point Comparisons
hive> SELECT name, salary, deductions['Federal Taxes']
> FROM employees WHERE deductions['Federal Taxes'] > 0.2;
John Doe 100000.0 0.2
Boss Man 200000.0 0.3
--(deductions['Federal Taxes']爲FLOAT類型,而hive會自動將小數(0.2)變爲double類型)
--(In this particular case, the closest exact value is just slightly greater than 0.2, with a few nonzero bits at the least significant end of the number.
To simplify things a bit, let’s say that 0.2 is actually 0.2000001 for FLOAT and 0.200000000001 for DOUBLE, because an 8-byte DOUBLEhas more significant digits (after the decimal point). When the FLOATvalue from the table is converted to DOUBLEby Hive, it produces the DOUBLEvalue 0.200000100000, which is greater than 0.200000000001. That’s why the query results appear to use >=not >!)
--(解決方案:1.將deductions['Federal Taxes']以字符串形式讀出再轉化爲double類型比較;
2. 可以將自己寫的小數(0.2)用cast(0.2 AS FLOAT)函數轉換到float類型再比較)
9.Having和Group by子句
hive> SELECT year(ymd), avg(price_close) FROM stocks
> WHERE exchange = 'NASDAQ' AND symbol = 'AAPL'
> GROUP BY year(ymd)
> HAVING avg(price_close) > 50.0;
1987 53.88968399108163
1991 52.49553383386182
--(GROUP BY要結合聚合函數使用)
10. JOIN Statements
10.1 hive> SELECT s.ymd, s.symbol, s.price_close, d.dividend
> FROM stocks s JOIN dividends d ON s.ymd = d.ymd AND s.symbol = d.symbol
> WHERE s.symbol = 'AAPL';
1987-05-11 AAPL 77.0 0.015
1987-08-10 AAPL 48.25 0.015
--(Hive supports the classic SQL JOIN statement, but only equi-joins are supported.)
--(Also, Hive does not currently support using OR between predicates in ON clauses)
10.2多個表的join:
hive> SELECT a.ymd, a.price_close, b.price_close , c.price_close
> FROM stocks a JOIN stocks b ON a.ymd = b.ymd
> JOIN stocks c ON a.ymd = c.ymd
> WHERE a.symbol = 'AAPL' AND b.symbol = 'IBM' AND c.symbol = 'GE';
2010-01-04 214.01 132.45 15.45
2010-01-05 214.38 130.85 15.53
--(Most of the time, Hive will use a separate MapReduce job for each pair of things to join. In this example, it would use one job for tables a and b, then a second job to join the output of the first join with c. Why not join band c first? Hive goes from left to right.)
10.3 Join Optimizations
在多個表進行join時,hive會緩存小表,之後會把大表給stream化.
Hive also assumes that the lasttable in the query is the largest. It attempts to buffer the other tables and then stream the last table through, while performing joins on individual records. Therefore, you should structure your join queries so the largest table is last.
Fortunately, you don’t have to put the largest table last in the query. Hive also provides a “hint” mechanism to tell the query optimizer which table should be streamed:
SELECT /*+ STREAMTABLE(s) */ s.ymd, s.symbol, s.price_close, d.dividend
FROM stocks s JOIN dividends d ON s.ymd = d.ymd AND s.symbol = d.symbol
WHERE s.symbol = 'AAPL';
10.4 LEFT OUTER JOIN
hive> SELECT s.ymd, s.symbol, s.price_close, d.dividend
> FROM stocks s LEFT OUTER JOIN dividends d ON s.ymd = d.ymd AND s.symbol = d.symbol
> WHERE s.symbol = 'AAPL';
...
1987-05-01 AAPL 80.0 NULL
1987-05-11 AAPL 77.0 0.015
10.5 關於outer join 所要注意的:
You might wonder if you can move the predicates from the WHERE clause into the ON clause, at least the partition filters. This does not work for outer joins, despite documentation on the Hive Wiki that claims it should work.
However, using such filter predicates in ON clauses for inner joins does work!
hive> SELECT s.ymd, s.symbol, s.price_close, d.dividend
> FROM stocks s LEFT OUTER JOIN dividends d
> ON s.ymd = d.ymd AND s.symbol = d.symbol
> AND s.symbol = 'AAPL' AND s.exchange = 'NASDAQ' AND d.exchange = 'NASDAQ';
1962-01-02 GE 74.75 NULL
1962-01-02 IBM 572.0 NULL
1962-01-03 GE 74.0 NULL
1962-01-03 IBM 577.0 NULL
10.6 outer join的其他類型:
RIGHT OUTER JOIN
Right-outer joins return all records in the right hand table that match the WHERE clause. NULL is used for fields of missing records in the left hand table.
FULL OUTER JOIN
Finally, a full-outer join returns all records from all tables that match the WHERE clause. NULL is used for fields in missing records in either table.
LEFT SEMI-JOIN
A left semi-join returns records from the left hand table if records are found in the right hand table that satisfy the ON predicates.相當於MySQL裏的IN語句,MySQL語句實例如下:.
Example 6-2. Query that will not work in Hive
SELECT s.ymd, s.symbol, s.price_close FROM stocks s
WHERE s.ymd, s.symbol IN
(SELECT d.ymd, d.symbol FROM dividends d);
Instead, you use the following LEFT SEMI JOIN syntax:
hive> SELECT s.ymd, s.symbol, s.price_close
> FROM stocks s LEFT SEMI JOIN dividends d ON s.ymd = d.ymd AND s.symbol = d.symbol;
1962-08-07 IBM 373.25
1962-05-08 IBM 459.5
--( Note that the SELECT and WHERE clauses can’t reference columns from the right hand table.)
--( 注意: Right semi-joins are not supported in Hive.)
Cartesian Product JOINs
笛卡爾積跟inner join相似,只是沒有on 子句.
hive > SELECT * FROM stocks JOIN dividends
> WHERE stock.symbol = dividends.symbol and stock.symbol='AAPL';
--( 注意: In Hive, this query computes the full Cartesian product before applying the WHERE clause. It could take a very long time to finish. )
Map-side Joins
Ø 在mapper端把小表緩存,大表stream化,減少reducer端的處理步驟,甚至有時會減少mapper端的處理步驟.
Ø 在v0.7之前的hive需要添加一個暗語(hint),來達到優化.
SELECT /*+ MAPJOIN(d) */ s.ymd, s.symbol, s.price_close, d.dividend
FROM stocks s JOIN dividends d ON s.ymd = d.ymd AND s.symbol = d.symbol
WHERE s.symbol = 'AAPL';
Ø 在v0.7開始不贊成這麼做,而是設置hive.auto.convert.join=true
Ø 多大的小表將被mapper端緩存,其臨界值可以進行設置,默認爲(byte)
hive.mapjoin.smalltable.filesize=25000000,
Ø Hive does not support the optimization for right- and full-outer joins.
11. ORDER BY and SORT BY
在hive中SORT BY相當於sql中的order by,如下:
SELECT s.ymd, s.symbol, s.price_close
FROM stocks s
SORT BY(ORDER BY ) s.ymd ASC, s.symbol DESC;
--(在hive中ORDER BY and SORT BY 的區別:
Order by:所有數據的排序通過一個reducer
Sort by:可以有多個reducer來進行排序,提高效率.)
--( 因爲order by用時太長,所以需要limit子句於order by一起用,當hive.mapred.mode=strict )
12. DISTRIBUTE BY with SORT BY
DISTRIBUTE BY相當於sql中的group by語句,其原理(By default, MapReduce computes a hash on the keys output by mappers and tries to
evenly distribute the key-value pairs among the available reducers using the hash values.這樣分配後比較亂,reducer在排序中容易重複比較同一鍵值對(即比較過一個條件後又比較另一個條件).所以We can use DISTRIBUTE BY to ensure that the records for each stock symbol go to the same reducer, then use SORT BY to order the data the way we want.)
實例:
hive> SELECT s.ymd, s.symbol, s.price_close
> FROM stocks s
> DISTRIBUTE BY s.symbol
> SORT BY s.symbol ASC, s.ymd ASC;
1984-09-07 AAPL 26.5
1984-09-10 AAPL 26.37
--( Note that Hive requires that the DISTRIBUTE BY clause come before the SORT BY clause.)
13. CLUSTER BY
--(Using DISTRIBUTE BY ... SORT BY or the shorthand CLUSTER BY clauses is a way to exploit the parallelism of SORT BY, yet achieve a total ordering across the output files.可以看成是”DISTRIBUTE BY ... SORT BY”的簡單寫法。)
實例1:
原數據:(string,string,int,string)
b,w,15,man
c,w,21,woman
c,w,20,man
muse,a,24,man
經過CLUSTER BY之後:
hive (wang)> select id,name,phone
> from wang_manage
> cluster by name;
muse a 24
b w 15
c w 21
c w 20
當CLUSTER BY 之後爲數字列時的排序如下:
b w 15
c w 20
c w 21
muse a 24
--(可以看出CLUSTER BY的列(不管是字符列還是數字列)默認按ascending,當該列相等時其他列若爲字符(第一列)則默認按ascending,其他列若爲數字類型(第三列)則默認按descending)
14. UNION ALL
-- (UNION ALL combines two or more tables. Each subquery of the union query must produce the same number of columns, and for each column, its type must match all the column types in the same position. For example, if the second column is a FLOAT, then the second column of all the other query results must be a FLOAT.)
SELECT log.ymd, log.level, log.message
FROM (
SELECT l1.ymd, l1.level,
l1.message, 'Log1' AS source
FROM log1 l1
UNION ALL
SELECT l2.ymd, l2.level,
l2.message, 'Log2' AS source
FROM log1 l2
) log
SORT BY log.ymd ASC;
--(UNION may be used when a clause selects from the same source table. Logically, the same results could be achieved with a single SELECT and WHERE clause. This technique increases readability by breaking up a long complex WHERE clause into two or more UNION queries. However, unless the source table is indexed, the query will have to make multiple passes over the same source data. )
For example:
FROM (
FROM src SELECT src.key, src.value WHERE src.key < 100
UNION ALL
FROM src SELECT src.* WHERE src.key > 110
) unioninput
INSERT OVERWRITE DIRECTORY '/tmp/union.out' SELECT unioninput.*
--( 可以看出當有多個select語句掃描同一個表時,這種寫法效率更高,避免掃描多次.)