Hadoop生態圈-Hive

Hive引言

什麼是Hive

    hive是facebook開源，並捐獻給了apache組織，作爲apache組織的頂級項目。 hive.apache.org
    hive是一個基於大數據技術的數據倉庫技術  DataWareHouse (數倉)
        數據庫  DataBase
               數據量級小，數據價值高
        數據倉庫 DataWareHouse
               數據體量大，數據價值低
    底層依附是HDFS,MapReduce

Hive的好處

Hive讓程序員應用時，書寫SQL語句，最終由Hive把SQL語句轉換成MapReduce運行，這樣簡化了程序員的工作。

Hive的運行原理

Hive是將大多數Hive SQL語句底層轉換爲MapReduce 運行Job作業來進行數據的處理

Hive環境搭建

1. linux服務器  ip 映射  主機名  關閉防火牆  關閉selinux  ssh免密登陸 jdk
2. 搭建hadoop環境
3. 安裝Hive
   3.1 解壓縮hive 
   3.2 hive_home/conf/hive-env.sh [改名]
       HADOOP_HOME=/opt/install/hadoop-2.5.2
       export HIVE_CONF_DIR=/opt/install/apache-hive-0.13.1-bin/conf
   3.2 hdfs創建2個目錄
       /tmp
       /user/hive/warehouse
       bin/hdfs dfs -mkdir /tmp
       bin/hdfs dfs -mkdir /user/hive/warehouse
   3.3 啓動hive
       bin/hive 
   3.4 jps
       runjar

Hive基本操作

# 創建數據庫
create database [if not exists] test;
# 查看所有數據庫
 show databases;
# 使用數據庫
 use db_name;
# 刪除空數據庫 
 drop database db_name;
 drop database db_name cascade;
# 查看數據庫的本質
 hive中的數據庫 本質是 hdfs的目錄 /user/hive/warehouse/test.db
  
# 查看當前數據庫下的所有表
  show tables;
# 建表語句
  create table t_user(
    id int ,
    name string
   )row format delimited fields terminated by '\t';
# 查看錶的本質
  hive中的表  本質是 hdfs的目錄 /user/hive/warehouse/test.db/t_user
# 刪除表
  drop table t_user;
  
# hive中向表導入數據
  load data local inpath '/root/hive/data' into table t_user;
# hive導入數據的本質
  load data local inpath '/root/hive/data' into table t_user;
  1. 導入數據 本質本質上就是 hdfs 上傳文件
  bin/hdfs dfs -put /root/hive/data /user/hive/warehouse/test.db/t_user;
  2. 上傳了重複數據，hive導數據時，會自動修改文件名
  3. 查詢某一個張表時，Hive會把表中這個目錄下所有文件的內容，整合查詢出來
  
  
# SQL(類SQL 類似於SQL HQL Hive Query Language)
select * from t_user;
select id from t_user;
1. Hive把SQL轉換成MapReduce (如果清洗數據 沒有Reduce)
2. Hive在絕大多數情況下運行MR,但是在* limit操作時不運行MR

MetaStore的替換問題

Hive中的MetaStore把HDFS對應結構，與表對應結果做了映射（對應）。但是默認情況下hive的metaStore應用的是derby數據庫，只支持一個client訪問。

Hive中元數據庫Derby替換成MySQL(Oracle)

0. 刪除hdfs /user/hive/warehouse目錄，並重新建立
1. linux mysql
   yum -y install mysql-server
2. 啓動mysql服務並設置管理員密碼
   service mysqld start
   /usr/bin/mysqladmin -u root password '123456'
3. 打開mysql遠程訪問權限
   GRANT ALL PRIVILEGES ON *.* TO 'root'@'%' IDENTIFIED BY '123456';
   flush privileges;   
   use mysql 
   delete from user where host like 'hadoop%';
   delete from user where host like 'l%';
   delete from user where host like '1%';
   service mysqld restart
4. 創建conf/hive-site.xml
   mv hive-default.xml.template hive-site.xml
   hive-site.xml
   <property>
	  <name>javax.jdo.option.ConnectionURL</name>
	  <value>jdbc:mysql://CentOSA:3306/metastore?createDatabaseIfNotExist=true</value>
	  <description>the URL of the MySQL database</description>
	</property>

	<property>
	  <name>javax.jdo.option.ConnectionDriverName</name>
	  <value>com.mysql.jdbc.Driver</value>
	  <description>Driver class name for a JDBC metastore</description>
	</property>

	<property>
	  <name>javax.jdo.option.ConnectionUserName</name>
	  <value>root</value>
	</property>

	<property>
	  <name>javax.jdo.option.ConnectionPassword</name>
	  <value>123456</value>
	</property>
5. hive_home/lib 上傳mysql driver jar包

Hive基礎語法

1.HQL

1. 基本查詢
   select * from table_name # 不啓動mr
   select id from table_name # 啓動mr
2. 條件查詢 where
   select id,name from t_users where name = 'mask1';
   2.1 比較查詢  =  ！=  >=  <=
       select id,name from t_users where age > 20;
   2.2 邏輯查詢  and or  not
       select id,name,age from t_users where name = 'mask' or age>30;
   2.3 謂詞運算
       between and
       select name,salary from t_users where salary between 100 and 300;
       in
       select name,salary from t_users where salary in (100,300);
       is null
       select name,salary from t_users where salary is null;
       like
       select name,salary from t_users where name like 'mask%';
       select name,salary from t_users where name like 'mask__';
       select name,salary from t_users where name like 'mask%' and length(name) = 6;
3. 排序 order by [底層使用的是 map sort  group sort  compareto]
   select name,salary from t_users order by salary desc;
4. 去重 distinct
   select distinct(age) from t_users;
5. 分頁 [Mysql可以定義起始的分頁條目，但是Hive不可以]
   select * from t_users limit 3;  
6. 聚合函數（分組函數） count() avg() max() min() sum() 
   count(*)  count(id) 區別
7. group by
   select max(salary) from t_users group by age;
   規矩： select 後面只能寫 分組依據和聚合函數 （Oracle報錯，Mysql不報錯，結果不對）
8. having 
   分組後，聚合函數的條件判斷用having
   select max(salary) from t_users group by age having max(salary) > 800;
9. hive不支持子查詢 
10. hive內置函數 
    show functions 

    length(column_name)  獲得列中字符串數據長度
    substring(column_name,start_pos,total_count)
    concat(col1,col2)
    to_data('yyyy-mm-dd')
    year(data) 獲得年份
    month(data)
    date_add
    ....
    select year(to_date('1999-10-11')) ;
11. 多表操作
    inner join
    select e.name,e.salary,d.dname
    from t_emp as e
    inner join t_dept as d
    on e.dept_id = d.id;
    
    select e.name,e.salary,d.dname
    from t_emp as e
    left join t_dept as d
    on e.dept_id = d.id;
    
    select e.name,e.salary,d.dname
    from t_emp as e
    right join t_dept as d
    on e.dept_id = d.id;
    
    select e.name,e.salary,d.dname [mysql 不支持]
    from t_emp as e
    full join t_dept as d
    on e.dept_id = d.id;

2.表操作

1）管理表 (MANAGED_TABLE)

1. 基本管理表的創建
create table if not exists table_name(
column_name data_type,
column_name data_type
)row format delimited fields terminated by '\t' [location 'hdfs_path']

2. as 關鍵字創建管理表
create table if not exists table_name as select id,name from t_users [location ''];
表結構 由 查詢的列決定，同時會把查詢結果的數據 插入新表中

3. like 關鍵字創建管理表
create table if not exists table_name like t_users [location 'hdfs_path'];
表結構 和 like關鍵字後面的表 一致，但是沒有數據是空表

細節操作

1. 數據類型 int string varchar char double float boolean  
2. location hdfs_path
   定製創建表的位置，默認是 /user/hive/warehouse/db_name.db/table_name
   create table t_mask(
   id,int
   name,string
   )row format delimited fields terminated by '\t' stored as textfile location /xiaohei ;
   啓示：日後先有hdfs目錄，文件，在創建表進行操作。
3. 查看hive表結構的命令
   desc table_name        describe table_name
   desc extended table_name
   desc formatted table_name

2)外部表

1. 基本
create external table if not exists table_name(
id int,
name string
) row delimited fields terminated by '\t' stored as textfile [location 'hdfs_path'];
2. as 
create external table if not exists table_name as select id,name from t_users [location ''];
3. like
create external table if not exists table_name like t_users [location 'hdfs_path'];

4. 管理表和外部表的區別
drop table t_users_as; 刪除管理表時，直接刪除metastore,同時刪除hdfs的目錄和數據文件
drop table t_user_ex;  刪除外部表時，刪除metastore的數據。
5. 外部表與管理表使用方式的區別

3) 分區表【優化查詢】

分區表是爲了提高條件查詢時的效率

create table t_user_part(
id int,
name string,
age int,
salary int)partitioned by (data string) row format delimited fields terminated by '\t';

load data local inpath '/root/data15' into table t_user_part partition (date='15');
load data local inpath '/root/data16' into table t_user_part partition (date='16');

select * from t_user_part  全表數據進行的統計

select id from t_user_part where data='15' and age>20;

4）桶表

5）臨時表

3. 數據的導入

1). 基本導入

   load data local inpath 'local_path' into table table_name

2). 通過as關鍵完成數據的導入

   建表的同時，通過查詢導入數據
   create table if not exists table_name as select id,name from t_users

3). 通過insert的方式導入數據

   #表格已經建好，通過查詢導入數據。
   create table t_users_like like t_users;
   
   insert into table t_users_like select id,name,age,salary from t_users;

4). hdfs導入數據

   load data inpath 'hdfs_path' into table table_name

5). 導入數據過程中數據的覆蓋

   load data inpath 'hdfs_path' overwrite into table table_name
   本質 把原有表格目錄的文件全部刪除，再上傳新的

6). 通過HDFS的API完成文件的上傳

   bin/hdfs dfs -put /xxxx  /user/hive/warehouse/db_name.db/table_name

4. 數據的導出

1). sqoop

     hadoop的一種輔助工具  HDFS/Hive  <------> RDB (MySQL,Oracle)

2). insert的方式

      #xiaohei一定不能存在，自動創建
      insert overwrite 【local】 directory '/root/xiaohei' select name from t_user;

3). 通過HDFS的API完成文件的下載

      bin/hdfs dfsd -get /user/hive/warehouse/db_name.db/table_name /root/xxxx

4). 命令行腳本的方式

      bin/hive --database 'test' -f /root/hive.sql > /root/result

5. Hive提供導入，導出的工具

      1. export 導出
      	export table tb_name to 'hdfs_path'
      2. import 導入
      	import table tb_name from 'hdfs_path'

6.與MR相關的配置

#與MR相關的參數
Map --> Split  ---> Block 
#reduce相關個數
mapred-site.xml
<property>
     <name>mapreduce.job.reduces</name>
     <value>1</value>
</property>
hive-site.xml
<!--1G-->
<property>
	  <name>hive.exec.reducers.bytes.per.reducer</name>
	  <value>1000000000</value>
</property>
<property>
     <name>hive.exec.reducers.max</name>
     <value>999</value>
</property>

站內首發文章

豆比米大

發佈了15 篇原創文章 · 獲贊 7 · 訪問量 3867

私信關注

Hadoop生態圈-Hive

Hive

Hive引言

Hive的運行原理

Hive環境搭建

Hive基本操作

MetaStore的替換問題

Hive基礎語法

1.HQL

2.表操作

1）管理表 (MANAGED_TABLE)

2)外部表

3) 分區表【優化查詢】

4）桶表

5）臨時表

3. 數據的導入

1). 基本導入

2). 通過as關鍵完成數據的導入

3). 通過insert的方式導入數據

4). hdfs導入數據

5). 導入數據過程中數據的覆蓋

6). 通過HDFS的API完成文件的上傳

4. 數據的導出

1). sqoop

2). insert的方式

3). 通過HDFS的API完成文件的下載

4). 命令行腳本的方式

5. Hive提供導入，導出的工具

6.與MR相關的配置

Spark入門(四)——Spark RDD算子使用方法

Spark入門(二)——Spark環境搭建與開發環境

Spark入門(三)——SparkRDD剖析(面試點)

Spark入門(一)——Spark的“前世今生”

GeoHash算法原理及實現

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結