hive的數據查詢的相關語法知識

select … from clause:

1.1array類型:

 hive> SELECT name, subordinates FROM employees;

John Doe  ["Mary Smith","Todd Jones"]

Todd Jones  []

--( ‘subordinates’爲array類型,並且字符串帶有雙引號。)

hive> SELECT name, subordinates[0] FROM employees;

John Doe  Mary Smith 

Todd Jones  NULL

--(可以和java一樣進行索引選擇,此時字符串沒有雙引號。空值返回NULL

1.2 map類型

hive> SELECT name, deductions FROM employees;

John Doe  {"Federal Taxes":0.2,"State Taxes":0.05,"Insurance":0.1}

 

hive> SELECT name, deductions["State Taxes"] FROM mployees;

John Doe  0.05

--’deductions’爲map類型,也可以和Java一樣對map通過鍵找到值)

1.3 struct類型:

hive> SELECT name, address FROM employees;

John Doe  {"street":"1 Michigan Ave.","city":"Chicago","state":"IL","zip":60600}

 

hive> SELECT name, address.city FROM employees;

John Doe  Chicago

--struct類型的成員引用和c的一樣。)

2.LIKE RLIKE區別:

1LIKE只支持‘_’和‘%’,其中前者代表匹配一位佔位符,或者可以是一或多個佔位符。

2RLIKEregular like的縮寫,顧名思義,其支持和Java一樣的正則表達式,比like功能更強大。

3.在select中使用聚集函數時記得設置:

hive> SET hive.map.aggr=true;

hive> SELECT count(*), avg(salary) FROM employees;

hive> SELECT count(DISTINCT symbol) FROM stocks;

4Table generating functions

1)生成單列表:

hive> SELECT explode(subordinates) AS sub FROM employees;

Mary Smith

Todd Jones

2)生成多列表:

SELECT parse_url_tuple(url, 'HOST', 'PATH', 'QUERY') as (host, path, query)

FROM url_table;

--(其他生成功能函數見page 87 of programming hive

5. 嵌套查詢:

hive> FROM (

>  SELECT upper(name), salary, deductions["Federal Taxes"] as fed_taxes,

>  FROM employees

> ) e

> SELECT e.name, e.salary_minus_fed_taxes

> WHERE e.salary_minus_fed_taxes > 70000 limit 2;

JOHN DOE  100000.0  0.2  80000

--(列的alias用‘as’,表的alias不用‘as’)

6.CASE …. WHEN字句:

hive> SELECT name, salary,

>  CASE

>  WHEN salary <  50000.0 THEN 'low'

>  WHEN salary >= 50000.0 AND salary <  70000.0 THEN 'middle'

>  WHEN salary >= 70000.0 AND salary < 100000.0 THEN 'high'

>  ELSE 'very high'

>  END AS bracket FROM employees;

John Doe  100000.0  very high

Mary Smith  80000.0  high

--CASE… WHEN…THEN… ELSE …END 相當於if … else…字句)

7. When Hive Can Avoid MapReduce

set hive.exec.mode.local.auto=true;

--( Hive  will  attempt  to  run  other  operations  in  local  mode,比如:

Select * from aa.)

--(同樣適用於分區表:SELECT * FROM employees

WHERE country = 'US' AND state = 'CA'

LIMIT 100;)

8. Gotchas with Floating-Point Comparisons

hive> SELECT name, salary, deductions['Federal Taxes']

> FROM employees WHERE deductions['Federal Taxes'] > 0.2;

John Doe  100000.0  0.2

Boss Man  200000.0  0.3

--deductions['Federal Taxes']爲FLOAT類型,而hive會自動將小數(0.2)變爲double類型)

--In this particular case, the closest exact value is just slightly greater than 0.2, with a few nonzero bits at the least significant end of the number.

To  simplify  things  a  bit,  let’s  say  that  0.2  is  actually  0.2000001  for  FLOAT and 0.200000000001 for DOUBLE, because an 8-byte  DOUBLEhas more significant digits (after the decimal point). When the  FLOATvalue from the table is converted to  DOUBLEby Hive, it produces the DOUBLEvalue 0.200000100000, which is greater than 0.200000000001. That’s why the query results appear to use >=not >!)

--(解決方案:1.deductions['Federal Taxes']以字符串形式讀出再轉化爲double類型比較

2. 可以將自己寫的小數(0.2)用cast0.2 AS FLOAT)函數轉換到float類型再比較)

9.HavingGroup by子句

hive> SELECT year(ymd), avg(price_close) FROM stocks

> WHERE exchange = 'NASDAQ' AND symbol = 'AAPL'

> GROUP BY year(ymd)

> HAVING avg(price_close) > 50.0;

1987  53.88968399108163

1991  52.49553383386182

--GROUP BY要結合聚合函數使用)

10. JOIN Statements

10.1 hive> SELECT s.ymd, s.symbol, s.price_close, d.dividend

> FROM stocks s JOIN dividends d ON s.ymd = d.ymd AND s.symbol = d.symbol

> WHERE s.symbol = 'AAPL';

1987-05-11  AAPL  77.0  0.015

1987-08-10  AAPL  48.25  0.015

--Hive supports the classic SQL JOIN statement, but only equi-joins are supported.)

--Also, Hive does not currently support using OR between predicates in ON clauses)

10.2多個表的join

hive> SELECT a.ymd,  a.price_close,  b.price_close ,  c.price_close

> FROM stocks a JOIN stocks b ON  a.ymd = b.ymd

>  JOIN stocks c ON a.ymd = c.ymd

> WHERE a.symbol = 'AAPL' AND b.symbol = 'IBM' AND c.symbol = 'GE';

2010-01-04  214.01  132.45  15.45

2010-01-05  214.38  130.85  15.53

--Most of the time, Hive will use a separate MapReduce job for each pair of things to join. In this example, it would use one job for tables a and b, then a second job to join the output of the first join with c. Why not join band c first? Hive goes from left to right.)

10.3 Join Optimizations

在多個表進行join,hive會緩存小表,之後會把大表給stream.

Hive also assumes that the lasttable in the query is the  largest. It attempts to buffer the other tables and then stream the last table through, while performing joins on individual records. Therefore, you should structure your join queries so the largest table is last.

Fortunately, you don’t have to put the largest table last in the query. Hive also provides a “hint” mechanism to tell the query optimizer which table should be streamed:

SELECT /*+ STREAMTABLE(s) */ s.ymd, s.symbol, s.price_close, d.dividend

FROM stocks s JOIN dividends d ON s.ymd = d.ymd AND s.symbol = d.symbol

WHERE s.symbol = 'AAPL';

10.4 LEFT OUTER JOIN

hive> SELECT s.ymd, s.symbol, s.price_close, d.dividend

> FROM stocks s LEFT OUTER JOIN dividends d ON s.ymd = d.ymd AND s.symbol = d.symbol

> WHERE s.symbol = 'AAPL';

...

1987-05-01  AAPL  80.0  NULL

1987-05-11  AAPL  77.0  0.015

10.5 關於outer join 所要注意的:

You might wonder if you can move the predicates from the WHERE clause into the ON clause, at least the partition filters. This does not work for outer joins, despite documentation on the Hive Wiki that claims it should work.

However, using such filter predicates in ON clauses for inner joins does work!

hive> SELECT s.ymd, s.symbol, s.price_close, d.dividend

> FROM stocks s LEFT OUTER JOIN dividends d

> ON s.ymd = d.ymd AND s.symbol = d.symbol

> AND s.symbol = 'AAPL' AND s.exchange = 'NASDAQ' AND d.exchange = 'NASDAQ';

1962-01-02  GE  74.75  NULL

1962-01-02  IBM  572.0  NULL

1962-01-03  GE  74.0  NULL

1962-01-03  IBM  577.0  NULL

10.6 outer join的其他類型:

RIGHT OUTER JOIN

Right-outer joins return all records in the right hand table that match the WHERE clause.  NULL is used for fields of missing records in the left hand table.

FULL OUTER JOIN

Finally, a full-outer join returns all records from all tables that match the WHERE clause. NULL is used for fields in missing records in either table.

LEFT SEMI-JOIN

A left semi-join returns records from the left hand table if records are found in the right hand table that satisfy the  ON predicates.相當於MySQL裏的IN語句,MySQL語句實例如下:.

Example 6-2. Query that will not work in Hive

SELECT s.ymd, s.symbol, s.price_close FROM stocks s

WHERE s.ymd, s.symbol IN

(SELECT d.ymd, d.symbol FROM dividends d);

Instead, you use the following LEFT SEMI JOIN syntax:

hive> SELECT s.ymd, s.symbol, s.price_close

> FROM stocks s LEFT SEMI JOIN dividends d ON s.ymd = d.ymd AND s.symbol = d.symbol;

1962-08-07  IBM  373.25

1962-05-08  IBM  459.5

--( Note that the  SELECT and  WHERE clauses can’t reference columns from the right hand table.)

--( 注意: Right semi-joins are not supported in Hive.)

Cartesian Product JOINs

笛卡爾積跟inner join相似,只是沒有on 子句.

hive > SELECT * FROM stocks JOIN dividends

> WHERE stock.symbol = dividends.symbol and stock.symbol='AAPL';

--( 注意: In  Hive,  this  query  computes  the  full Cartesian  product  before  applying  the  WHERE clause. It could take a very long time to finish. )

Map-side Joins

Ø 在mapper端把小表緩存,大表stream,減少reducer端的處理步驟,甚至有時會減少mapper端的處理步驟.

Ø 在v0.7之前的hive需要添加一個暗語(hint),來達到優化.

SELECT /*+ MAPJOIN(d) */ s.ymd, s.symbol, s.price_close, d.dividend

FROM stocks s JOIN dividends d ON s.ymd = d.ymd AND s.symbol = d.symbol

WHERE s.symbol = 'AAPL';

Ø 在v0.7開始不贊成這麼做,而是設置hive.auto.convert.join=true

Ø 多大的小表將被mapper端緩存,其臨界值可以進行設置,默認爲(byte)

hive.mapjoin.smalltable.filesize=25000000,

Ø Hive does not support the optimization for right- and full-outer joins.

11. ORDER BY and SORT BY

hiveSORT BY相當於sql中的order by,如下:

SELECT s.ymd, s.symbol, s.price_close

FROM stocks s

SORT BY(ORDER BY ) s.ymd ASC, s.symbol DESC;

--(hiveORDER BY and SORT BY 的區別:

Order by:所有數據的排序通過一個reducer

Sort by:可以有多個reducer來進行排序,提高效率.)

--( 因爲order by用時太長,所以需要limit子句於order by一起用,hive.mapred.mode=strict )

12. DISTRIBUTE BY with SORT BY

DISTRIBUTE BY相當於sql中的group by語句,其原理(By default, MapReduce computes a hash on the keys output by mappers and tries to

evenly distribute the key-value pairs among the available reducers using the hash values.這樣分配後比較亂,reducer在排序中容易重複比較同一鍵值對(即比較過一個條件後又比較另一個條件).所以We  can  use DISTRIBUTE BY to ensure that the records for each stock symbol go to the same reducer, then use  SORT BY to order the data the way we want.)

實例:

hive> SELECT s.ymd, s.symbol, s.price_close

> FROM stocks s

> DISTRIBUTE BY s.symbol

> SORT BY  s.symbol ASC, s.ymd ASC;

1984-09-07  AAPL  26.5

1984-09-10  AAPL  26.37

--( Note that Hive requires that the DISTRIBUTE BY clause come before the SORT BY clause.)

13. CLUSTER BY

--(Using  DISTRIBUTE BY ... SORT BY or the shorthand  CLUSTER BY clauses is a way to exploit the parallelism of  SORT BY, yet achieve a total ordering across the output files.可以看成是”DISTRIBUTE BY ... SORT BY”的簡單寫法。)

實例1

原數據:(string,string,int,string

b,w,15,man

c,w,21,woman

c,w,20,man

muse,a,24,man

經過CLUSTER BY之後:

hive (wang)> select id,name,phone                 

           > from wang_manage

           > cluster by name;

muse    a       24

b       w       15

c       w       21

c       w       20

CLUSTER BY 之後爲數字列時的排序如下:

b       w       15

c       w       20

c       w       21

muse    a       24

--(可以看出CLUSTER BY的列(不管是字符列還是數字列)默認按ascending,當該列相等時其他列若爲字符(第一列)則默認按ascending,其他列若爲數字類型(第三列)則默認按descending

14. UNION ALL

-- (UNION ALL combines two or more tables. Each subquery of the union query must produce the same number of columns, and for each column, its type must match all the column types in the same position. For example, if the second column is a FLOAT, then the second column of all the other query results must be a FLOAT.)

SELECT log.ymd, log.level, log.message

FROM (

SELECT l1.ymd, l1.level,

l1.message, 'Log1' AS source

FROM log1 l1

UNION ALL

SELECT l2.ymd, l2.level,

l2.message, 'Log2' AS source

FROM log1 l2

) log

SORT BY log.ymd ASC;

--(UNION may be used when a clause selects from the same source table. Logically, the same results could be achieved with a single  SELECT and  WHERE clause. This technique increases readability by breaking up a long complex  WHERE clause into two or more  UNION queries. However, unless the source table is indexed, the query will have to make multiple passes over the same source data. )

For example:

FROM (

FROM src SELECT src.key, src.value WHERE src.key < 100

UNION ALL

FROM src SELECT src.* WHERE src.key > 110

) unioninput

INSERT OVERWRITE DIRECTORY '/tmp/union.out' SELECT unioninput.*

--( 可以看出當有多個select語句掃描同一個表時,這種寫法效率更高,避免掃描多次.)

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章