文章目錄

概述

本文將基於Hive數據倉庫工具對一份網站日誌進行數據分析，包括分析IP地址。包括在插入數據時使用正則表達式對日誌文件進行預處理、利用UDF進行數據清洗、使用ORC格式存儲和SNAPPY壓縮等。

1. 引出需要進行數據預處理的必要性→

原日誌文件的字段信息統計如下，總共11個字段：

日誌文件中信息展示：

"27.38.5.159" 
"-" 
"31/Aug/2015:00:04:37 +0800" 
"GET /course/view.php?id=27 HTTP/1.1" 
"303" 
"440" 
- 
"http://www.ibeifeng.com/user.php?act=mycourse" 
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.63 Safari/537.36" 
"-" 
"learn.ibeifeng.com"

正常創建表的操作

// 建表，以空格劃分字段
create table IF NOT EXISTS default.bf_log_src (
remote_addr string,
remote_user string,
time_local string,
request string,
status string,
body_bytes_sent string,
request_body string,
http_referer string,
http_user_agent string,
http_x_forwarded_for string,
host string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ' '
stored as textfile ;
// 載入數據
load data local inpath '/opt/datas/moodle.ibeifeng.access.log' into table bf_log_src ;
// 查看錶結構
desc formatted bf_log_src;
// 查詢行數
select count(*) from bf_log_src ;
// 查看前5行
select * from  bf_log_src limit 5 ;

發現問題
通過select * from bf_log_src limit 5 ;之後會發現，表裏並沒有正常顯示日誌文件裏的11個字段的值，而是只有前面8 個字段，後面字段丟失了。
仔細觀察日誌文件信息，可以發現有些字段中本身存在空格
解決問題
推薦的解決問題方式是：利用正則表示式過濾。當然，數據預處理也可以藉助Python腳本，可以參照基於Python預處理、用Hive對movielens數據集進行分析

2. 使用RegexSerDe處理apache或者ngnix日誌文件→

Apache官網對日誌文件的處理示例(示例中的正則表達是有誤的)

CREATE TABLE apachelog (
  host STRING,
  identity STRING,
  user STRING,
  time STRING,
  request STRING,
  status STRING,
  size STRING,
  referer STRING,
  agent STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
  "input.regex" = "([^]*) ([^]*) ([^]*) (-|\\[^\\]*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\".*\") ([^ \"]*|\".*\"))?"
)
STORED AS TEXTFILE;

只要套用Apache官網的示例，修改正則表達式爲自己適用的即可。
本項目的正確建表姿勢

drop table if exists default.bf_log_src ;
create table IF NOT EXISTS default.bf_log_src (
remote_addr string,
remote_user string,
time_local string,
request string,
status string,
body_bytes_sent string,
request_body string,
http_referer string,
http_user_agent string,
http_x_forwarded_for string,
host string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
  "input.regex" = "(\"[^ ]*\") (\"-|[^ ]*\") (\"[^\]]*\") (\"[^\"]*\") (\"[0-9]*\") (\"[0-9]*\") (-|[^ ]*) (\"[^ ]*\") (\"[^\"]*\") (-|[^ ]*) (\"[^ ]*\")"
)
STORED AS TEXTFILE;

load data local inpath '/opt/datas/moodle.ibeifeng.access.log' into table default.bf_log_src ;

上文使用到的正則表達式語法
小括號內看成一個整體
\代表轉義
|代表或
^代表取反
*代表匹配多個
.*代表匹配所有的
[0-9]代表單個數字
[0-9]*代表0-9之間的多個字符

3. 根據不同業務拆表→

3.1 需求分析

ip地址
- 依據ip地址確定區域，定向營銷
- 用戶統計，訪問某一網站數
訪問時間
- 分析用戶訪問網站的時間段
- 針對銷售來說，合理安排值班
請求地址
- 瞭解用戶最關注的產品
- 定向投放產品
轉入鏈接
- 關注用戶如何訪問我們的產品
- 定向某個區域，進行廣告投放

3.2 拆表

drop table if exists default.bf_log_comm ;
create table IF NOT EXISTS default.bf_log_comm (
remote_addr string,
time_local string,
request string,
http_referer string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS orc tblproperties ("orc.compress"="SNAPPY");
# 最後一行：設置orc存儲格式（orc的特點是存儲空間小），指定snappy壓縮

# 子查詢插入數據
insert into table default.bf_log_comm select remote_addr, time_local, request,http_referer from  default.bf_log_src ;

# 查詢前5條記錄
select * from bf_log_comm limit 5 ;

4. 數據清洗→

4.1 Hive自定義函數的方式

maven項目下添加依賴：hadoop-client、hive-exec、hive-jdbc
繼承UDF類：import org.apache.hadoop.hive.ql.exec.UDF;
實現至少一個evaluate方法，evaluate方法支持重載，並且返回值不允許爲void
寫完程序，打jar包上傳到Linux系統中（linux有IDE的話，直接在linux上操作）。截圖加文字，詳細記錄IDEA導出jar包的方式
與jar包進行關聯：add jar /opt/datas/udf.jar;
創建function函數方法：create temporary function my_udf as '包名.BigDataUdf';
SQL裏調用該自定義函數

4.2 UDF去除數據雙引號

show functions; 查看hive中的函數，發現沒有適合的去除引號的函數。其實有，但是需要調用不止一個函數進行處理，這樣會降低性能，UDF更好，所以說，合理使用UDF也是Hive調優的方式。企業裏常常會爲Hive定義成千上百個UDF。
綜上，通過自定義函數的方式來解決這個問題

UDF代碼如下：

package com.bigdata.hive.udf;

import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;
/**
 * 1. Implement one or more methods named "evaluate" which will be called by Hive.
 * 2. "evaluate" should never be a void method. However it can return "null" if needed.
 *
 */
public class RemoveQuotesUDF extends UDF {

    public Text evaluate(Text str){
        // validate
        if(null == str){
            return null ;
        }
        if(null == str.toString()){
            return null ;
        }

        // remove
        return new Text (str.toString().replaceAll("\"", ""))  ;
    }

    public static void main(String[] args) {
        System.out.println(new RemoveQuotesUDF().evaluate(new Text("\"12\"")));
    }
}

使用UDF，以覆蓋的方式插入數據

# 添加jar
add jar /opt/datas/jar/hiveUDF.jar ;
# 創建自定義函數
create temporary function my_removequotes as "com.bigdata.hive.udf.RemoveQuotesUDF" ;
# 查看有多少jar
list jars;

insert overwrite table default.bf_log_comm select my_removequotes(remote_addr), my_removequotes(time_local), my_removequotes(request), my_removequotes(http_referer) from  default.bf_log_src ;

select * from bf_log_comm limit 5 ;

4.3 UDF轉換日期時間格式

如4.2的模式，編寫完程序並上傳到linux，add到Hive，create temporary function。

時間轉換函數代碼：

package com.bigdata.hive.udf;

import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;

import java.text.ParseException;
import java.text.SimpleDateFormat;
import java.util.Date;
import java.util.Locale;

public class DateTransformUDF extends UDF {

    // set date format for input and output
    private final SimpleDateFormat inputFormat = new SimpleDateFormat("dd/MMM/yyyy:hh:mm:ss",Locale.ENGLISH);
    private final SimpleDateFormat outputFormat = new SimpleDateFormat("yyyyMMddHHmmss");

    public Text evaluate(Text input){

        Text output = new Text();
        if(null == input)
            return null;
        String inputDate = input.toString().trim();
        if(inputDate.equals(""))
            return null;

        try {
            // parse
            Date parseDate = inputFormat.parse(inputDate);
            // date transform(set format)
            String outputDate = outputFormat.format(parseDate);
            // String to Text
            output.set(outputDate);
        } catch (ParseException e) {
            e.printStackTrace();
            return output;
        }
        return output;
    }

    public static void main(String[] args) {
        System.out.println(new DateTransformUDF().evaluate(new Text("31/Aug/2015:00:04:37 +0800")));
    }

}

覆蓋數據

insert overwrite table default.bf_log_comm select my_removequotes(remote_addr), my_datetransform(my_removequotes(time_local)), my_removequotes(request), my_removequotes(http_referer) from  default.bf_log_src ;

select * from bf_log_comm limit 5 ;

5. 編寫hql分許數據→

5.1 分析用戶訪問網站的時間段

當前時間字段time_local的值是yyyyMMddHHmmss格式
分析用戶訪問網站的時間段只需要獲取HH（小時）即可
使用Hive提供的函數截取time_local字符串：substring（小標從1開始）
查看函數詳細使用方法：desc function extended substring;

hql：

select hour, count(hour) cnt from 
(select substring(time_local, 9, 2) hour from bf_log_comm) t 
group by hour order by cnt desc;

結果分析
用戶一般在下午3點到5點訪問網站。

5.2 分析用戶的ip地址

只需要根據ip地址的前2段即可獲知地域信息，因此只需要查詢ip字段的前兩段
在國內，前兩段最多7位，最少5位，所以可以用substring(remote_addr,1,7)截取字段。當然，也可以用UDF，這裏使用UDF。

UDF代碼如下：

    public Text evaluate(Text input){
        // verify
        if(null == input || null == input.toString()) return null;

        Text output = new Text();
        // split by "."
        String[] inputSplit = input.toString().trim().split("\\.");
        // split join
        String outputStr = inputSplit[0]+"."+inputSplit[1];
        // set output
        output.set(outputStr);
        return output;
    }

hql如下

add jar /opt/datas/hiveUDF.jar
create temporary function my_getRegion as 'com.bigdata.hive.udf.GetRegionUDF';

select t.addr, count(addr) cnt from 
(
select my_getRegion(remote_addr) addr from  bf_log_comm
) t 
group by t.addr order by cnt desc limit 12;

ip對應的地理位置可以放在一張表裏，和查詢結果join一下–》小表對大表：map join

總結

指對兩個字段進行分析，其餘兩個字段的分析是類似的。學過spark就知道，以上的分析在spark裏，只要一行就夠了。不過spark只能代替作爲查詢引擎，卻不能代替hive作爲大數據倉庫工具本身，因此，有必要認真學習。我的博客裏有另一篇關於Hive實戰的文章，裏邊的操作和分析會相對複雜些，也會用到sqoop、mysql等。

pomelorange

發佈了9 篇原創文章 · 獲贊 6 · 訪問量 2374

私信關注

Hive_基於Hive的網站日誌分析

文章目錄

概述

1. 引出需要進行數據預處理的必要性→

2. 使用RegexSerDe處理apache或者ngnix日誌文件→

3. 根據不同業務拆表→

3.1 需求分析

3.2 拆表

4. 數據清洗→

4.1 Hive自定義函數的方式

4.2 UDF去除數據雙引號

4.3 UDF轉換日期時間格式

5. 編寫hql分許數據→

5.1 分析用戶訪問網站的時間段

5.2 分析用戶的ip地址

總結

C#開源的兩款功能強大的錄屏神器

認知提升的方法

螞蟻面試：Springcloud核心組件的底層原理，你知道多少？

業務指標分析 | 多維度統計：統計每一個省份每一小時點擊Top3的廣告

圖數據庫 | 我用Neo4j 實現了柯南和怪盜基德周邊動態關係圖譜

業務指標分析 | 我用一條SQL統計了PV、UV和二跳率

3種方式幫你完成J2EE業務系統根據taskID啓動對應spark應用

Spark企業級交互式用戶行爲分析系統架構

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結