pig 部署

Linux ISOCentOS-6.0-i386-bin-DVD.iso 32

JDK version"1.6.0_25-ea"

Hadoop software versionhadoop-0.20.205.0.tar.gz

Hbase versionhbase-0.90.5

Pig versionpig-0.9.2.tar.gz   http://mirror.bjtu.edu.cn/apache/pig/pig-0.9.2/pig-0.9.2.tar.gz  北京大學的apache鏡像下載,這個版本呢其實不是最新的但和hadoop0.20.2版本匹配,pig版本與hadoop版本也是有配達要求的,請注意你自己的安裝hadoop版本是啥,上網搜索一下對應的版本,當然從上面的網站上可以下載pig全系列,例如pig-0.10.0.tar.gz  這裏我就不一一舉例了。

2.Pig安裝模式

Local模式:實際就是單機模式,pig只能訪問本地一臺主機,沒有分佈式,甚至可以不用安裝hadoop,所有的命令執行和文件讀寫都在本地進行,常用於作業實驗。

Local模式:只需要配置export

PATH=/usr/java/jdk1.6.0_25/bin:/home/grid/hadoop-0.20.2/bin:/home/grid/pig-0.9.2/bin:$PATH               1個環境變量即可

MapReduce模式:這種模式纔是實際應用中的工作模式,它可以將文件上傳到HDFS系統中,在使用pig latin語言運行作業時,可以將作業分佈在hadoop集羣中完成,這也體現了MapReduce的思想,這樣我們通過pig客戶端連接hadoop集羣進行數據管理和分析工作。

需要配置PATH     PIG_CLASSPATH     hosts文件   啓動pig

本次主要介紹MapReduce模式安裝,因爲這種安裝模式在實際中最常用也是最有意義的。

Pig作爲hadoop的客戶端,Pig安裝包可以安裝在集羣任何節點上,它可以在任何節點上提交作業,我這次安裝在master節點上爲了是方便了解部署架構。

3.驗證Hadoop集羣狀態

使用shell命令行方式驗證

[grid@h1 hadoop-0.20.2]$ bin/hadoop dfsadmin -report

Configured Capacity: 19865944064 (18.5 GB)

Present Capacity: 8833888256 (8.23 GB)

DFS Remaining: 8833495040 (8.23 GB)

DFS Used: 393216 (384 KB)

DFS Used%: 0%

Under replicated blocks: 4

Blocks with corrupt replicas: 0

Missing blocks: 0

-------------------------------------------------

Datanodes available: 2 (2 total, 0 dead)   --2個節點存活無shutdown

Name: 192.168.2.103:50010           -- slaves  h2

Decommission Status : Normal         --狀態正常

Configured Capacity: 9932972032 (9.25 GB)

DFS Used: 196608 (192 KB)

Non DFS Used: 5401513984 (5.03 GB)

DFS Remaining: 4531261440(4.22 GB)

DFS Used%: 0%

DFS Remaining%: 45.62%

Last contact: Fri Nov 02 18:58:02 CST 2012

Name: 192.168.2.105:50010                      -- slaves  h4

Decommission Status : Normal                    --狀態正常

Configured Capacity: 9932972032 (9.25 GB)

DFS Used: 196608 (192 KB)

Non DFS Used: 5630541824 (5.24 GB)

DFS Remaining: 4302233600(4.01 GB)

DFS Used%: 0%

DFS Remaining%: 43.31%

Last contact: Fri Nov 02 18:58:02 CST 2012

[grid@h1hadoop-0.20.2]$ jps           master -> hadoophbase都啓動了

22926 HQuorumPeer

4709 JobTracker

22977 HMaster

4515 NameNode

4650 SecondaryNameNode

31681 Jps

[grid@h2tmp]$ jps                    slave1 -> hadoophbase都啓動了

17188 TaskTracker

22181 Jps

13800 HRegionServer

13727 HQuorumPeer

17077 DataNode

[grid@h4logs]$ jps                    slave2 -> hadoophbase都啓動了

27829 TaskTracker

19978 Jps

26875 Jps

17119 DataNode

11636 HRegionServer

11557 HQuorumPeer

4.Pig安裝與配置

(1)把pig-0.9.2.tar.gz上傳到h1:/home/grid/目錄下並tar解包

[grid@h1 grid]$ pwd

/home/grid

[grid@h1 grid]$ ll

總用量 46832

-rwxrwxrwx.  1 grid hadoop       44  9月 18 19:10 abc.txt

-rwxrwxrwx.  1 grid hadoop     5519 10月 12 22:09 Exercise_1.jar

drwxr-xr-x. 14 grid hadoop     4096  9月 18 07:05 hadoop-0.20.2

drwxr-xr-x. 10 grid hadoop     4096 10月 28 21:13 hbase-0.90.5

-rwxrw-rw-.  1 grid hadoop 47875717 11月  2 06:44 pig-0.9.2.tar.gz

[grid@h1 grid]$ tar -zxvf pig-0.9.2.tar.gz

[grid@h1 grid]$ ll

總用量 46836

-rwxrwxrwx.  1 grid hadoop       44  9月 18 19:10 abc.txt

-rwxrwxrwx.  1 grid hadoop     5519 10月 12 22:09 Exercise_1.jar

drwxr-xr-x. 14 grid hadoop     4096  9月 18 07:05 hadoop-0.20.2

drwxr-xr-x. 10 grid hadoop     4096 10月 28 21:13 hbase-0.90.5

drwxr-xr-x.  2 grid hadoop     4096  9月 16 19:57 input

drwxr-xr-x. 15 grid hadoop     4096  1月 18 2012 pig-0.9.2

-rwxrw-rw-.  1 grid hadoop 47875717 11月  2 06:44 pig-0.9.2.tar.gz


(2)配置Pig的環境變量             紅色字體都是要修改的

[grid@h1 grid]$ vim .bashrc

export JAVA_HOME=/usr              --不要寫java目錄本身,要寫上級目錄才生效

export JRE_HOME=/usr/java/jdk1.6.0_25/jre

export

PATH=/usr/java/jdk1.6.0_25/bin:/home/grid/hadoop-0.20.2/bin:/home/grid/pig-0.9.2/bin:$PATH

--添加hadoop軟件命令目錄和pig軟件命令目錄,作用告訴shell命令行到哪個目錄下去找命令or程序

export CLASSPATH=./:/usr/java/jdk1.6.0_25/lib:/usr/java/jdk1.6.0_25/jre/lib

export PIG_CLASSPATH=/home/grid/hadoop-0.20.2/conf      --既然是MapReduce模式,就要讓Pig軟件找到Hadoop集羣,這裏是告訴pig軟件hadoop的配置文件在哪裏,通過一系列配置文件core_site.xml    hdfs-site.xml   mapred-site.xml 可以找到關鍵參數NameNode 和 JobTracker 的位置以及端口信息,有了這些信息就可以對整個集羣進行控制了。


方法二  編輯/home/grid/pig-0.9.2/conf /pig.properties        也可以啓動MapReduce模式

添加

fs.default.name= hdfs://h1:9000             找到namenode信息

mapred.job.tracker= h1:9001                   找到jobtracker信息


(3)使環境變量生效

[grid@h1 grid]$ source  .bashrc                  加載環境變量使之生效


(4)查看hosts文件

[grid@h1 grid]$ cat /etc/hosts

192.168.2.102     h1   # Added by NetworkManager

127.0.0.1      localhost.localdomain       localhost

::1   h1   localhost6.localdomain6   localhost6

192.168.2.102   h1

192.168.2.103   h2

192.168.2.105   h4

這個文件是主機名和IP地址映射文件,一般在Hadoop集羣中都使用主機名進行通信的,在配置文件中也使用主機名進行配置。


(5)啓動pig

[grid@h1 grid]$ pig -x mapreduce              也可以只用pig命令進入shell

2012-11-02 20:09:22,149 [main] INFO  org.apache.pig.Main - Logging error messages to: /home/grid/pig_1351858162147.log

2012-11-02 20:09:23,314 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://h1:9000            --pig找到namenode

2012-11-02 20:09:27,950 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at: h1:9001                 --pig找到jobtracker

grunt> quit                                 退出pig客戶端

[grid@h1 grid]$

[grid@h1 grid]$ pig                   也可以只用pig命令進入shell

2012-11-02 20:16:17,968 [main] INFO  org.apache.pig.Main - Logging error messages to: /home/grid/pig_1351858577966.log

2012-11-02 20:16:18,100 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://h1:9000

2012-11-02 20:16:18,338 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at: h1:9001

grunt> help                                幫助命令列表

Commands:

<pig latin statement>; - See the PigLatin manual for details:http://hadoop.apache.org/pig

File system commands:

    fs <fs arguments> - Equivalent to Hadoop dfs command:http://hadoop.apache.org/common/docs/current/hdfs_shell.html

Diagnostic commands:

    describe <alias>[::<alias] - Show the schema for the alias. Inner aliases can be described as A::B.

    explain [-script. <pigscript>] [-out <path>] [-brief] [-dot] [-param <param_name>=<param_value>]

        [-param_file <file_name>] [<alias>] - Show the execution plan to compute the alias or for entire script.

        -script. - Explain the entire script.

        -out - Store the output into directory rather than print to stdout.

        -brief - Don't expand nested plans (presenting a smaller graph for overview).

        -dot - Generate the output in .dot format. Default is text format.

        -param <param_name - See parameter substitution for details.

        -param_file <file_name> - See parameter substitution for details.

        alias - Alias to explain.

    dump <alias> - Compute the alias and writes the results to stdout.

Utility Commands:

    exec [-param <param_name>=param_value] [-param_file <file_name>] <script> -

        Execute the script. with access to grunt environment including aliases.

        -param <param_name - See parameter substitution for details.

        -param_file <file_name> - See parameter substitution for details.

        script. - Script. to be executed.

    run [-param <param_name>=param_value] [-param_file <file_name>] <script> -

        Execute the script. with access to grunt environment.

        -param <param_name - See parameter substitution for details.

        -param_file <file_name> - See parameter substitution for details.

        script. - Script. to be executed.

    kill <job_id> - Kill the hadoop job specified by the hadoop job id.

    set <key> <value> - Provide execution parameters to Pig. Keys and values are case sensitive.

        The following keys are supported:

        default_parallel - Script-level reduce parallelism. Basic input size heuristics used by default.

        debug - Set debug on or off. Default is off.

        job.name - Single-quoted name for jobs. Default is PigLatin:<script. name>

        job.priority - Priority for jobs. Values: very_low, low, normal, high, very_high. Default is normal

        stream.skippath - String that contains the path. This is used by streaming.

        any hadoop property.

    help - Display this message.

    quit - Quit the grunt shell.


(6)pig操作命令

自動補全機制(大小寫敏感):就跟linux中的命令自動補全一樣,當你輸入一半的命令按住tab鍵就可以輸出整個命令,但不能補全文件名哦!


grunt> ls                                      顯示根目錄的內容

hdfs://h1:9000/user/grid/in        <dir>         dir表示目錄的意思   <r 3>表示文件的意思

hdfs://h1:9000/user/grid/out1       <dir>

hdfs://h1:9000/user/grid/out2       <dir>

grunt> cd in                                進入in子目錄

grunt> ls

hdfs://h1:9000/user/grid/in/test_1<r 3>     324       324個字節

hdfs://h1:9000/user/grid/in/test_2<r 3>     134       134個字節

grunt> cat test_1                               顯示test_1文件內容

Apr 23 11:49:54 hostapd: wlan0: STA 14:7d:c5:9e:fb:84

Apr 23 11:49:52 hostapd: wlan0: STA 74:e5:0b:04:28:f2

Apr 23 11:49:50 hostapd: wlan0: STA cc:af:78:cc:d5:5d

Apr 23 11:49:44 hostapd: wlan0: STA cc:af:78:cc:d5:5d

Apr 23 11:49:43 hostapd: wlan0: STA 74:e5:0b:04:28:f2

Apr 23 11:49:42 hostapd: wlan0: STA 14:7d:c5:9e:fb:84

grunt> cat test_2                               顯示test_2文件內容

13599999999 10086

13899999999 120

13944444444 13800138000

13722222222 13800138000

18800000000 120

13722222222 10086

18944444444 10086


在grunt>中全是絕對路徑,沒有相對路徑的顯示

在grunt>中引入了當前目錄的概念,可以對當前目錄進行記憶和管理

在grunt>中直接對HDFS文件系統操作,不用在寫煩瑣的HDFS命令了


copyFromLocal            把操作系統中的東西->拷貝->HDFS文件系統中

grunt> copyFromLocal /home/grid/access_log.txt pig/access_log.txt

grunt> ls

hdfs://h1:9000/user/grid/in        <dir>

hdfs://h1:9000/user/grid/out1       <dir>

hdfs://h1:9000/user/grid/out2       <dir>

hdfs://h1:9000/user/grid/pig     <dir>

grunt> cd pig

grunt> ls

hdfs://h1:9000/user/grid/pig/access_log.txt<r 2>    7118627     字節數對的上


copyToLocal                    把HDFS文件系統中的東西->拷貝->操作系統中

grunt> copyToLocal test_1 ttt

grunt> ls

hdfs://h1:9000/user/grid/in/test_1<r 3>     324

hdfs://h1:9000/user/grid/in/test_2<r 3>     134

[grid@h1 grid]$ cat ttt                               完美拷貝

Apr 23 11:49:54 hostapd: wlan0: STA 14:7d:c5:9e:fb:84

Apr 23 11:49:52 hostapd: wlan0: STA 74:e5:0b:04:28:f2

Apr 23 11:49:50 hostapd: wlan0: STA cc:af:78:cc:d5:5d

Apr 23 11:49:44 hostapd: wlan0: STA cc:af:78:cc:d5:5d

Apr 23 11:49:43 hostapd: wlan0: STA 74:e5:0b:04:28:f2

Apr 23 11:49:42 hostapd: wlan0: STA 14:7d:c5:9e:fb:84


sh 命令                                   在grunt>裏直接運行操作系統命令

grunt> sh pwd

/home/grid

grunt> sh cat ttt

Apr 23 11:49:54 hostapd: wlan0: STA 14:7d:c5:9e:fb:84

Apr 23 11:49:52 hostapd: wlan0: STA 74:e5:0b:04:28:f2

Apr 23 11:49:50 hostapd: wlan0: STA cc:af:78:cc:d5:5d

Apr 23 11:49:44 hostapd: wlan0: STA cc:af:78:cc:d5:5d

Apr 23 11:49:43 hostapd: wlan0: STA 74:e5:0b:04:28:f2

Apr 23 11:49:42 hostapd: wlan0: STA 14:7d:c5:9e:fb:84

5.Pig案例應用

題目:請使用Pig latin語言處理access_log.txt日誌,計算出每個IP的點擊數。

我們看一下命令列表,下面是我們常用的pig latin語言

<EOF>

    "cat" ...

    "fs" ...

    "sh" ...

    "cd" ...

    "cp" ...

    "copyFromLocal" ...

    "copyToLocal" ...

    "dump" ...

    "describe" ...

    "aliases" ...

    "explain" ...

    "help" ...

    "kill" ...

    "ls" ...

    "mv" ...

    "mkdir" ...

    "pwd" ...

    "quit" ...

    "register" ...

    "rm" ...

    "rmf" ...

    "set" ...

    "illustrate" ...

    "run" ...

    "exec" ...

    "scriptDone" ...

    "" ...

    <EOL> ...

    ";" ...

grunt> pwd

hdfs://h1:9000/user/grid/pig

grunt> ls

hdfs://h1:9000/user/grid/pig/access_log.txt<r 2>    7118627      這就是我們要處理的文件

grunt> cat access_log.txt      我們來看一下文件的內容之後進行數據分析

119.146.220.12 - - [31/Jan/2012:23:59:51 +0800] "GET /static/js/jquery-1.6.js HTTP/1.1" 404 299 "http://f.dataguru.cn/forum.php?mod=forumdisplay&fid=53&page=1" "Mozilla/5.0 (Windows NT 5.1; rv:8.0.1) Gecko/20100101 Firefox/8.0.1"

119.146.220.12 - - [31/Jan/2012:23:59:52 +0800] "GET /static/js/floating-jf.js HTTP/1.1" 404 300 "http://f.dataguru.cn/forum.php?mod=forumdisplay&fid=53&page=1" "Mozilla/5.0 (Windows NT 5.1; rv:8.0.1) Gecko/20100101 Firefox/8.0.1"

119.146.220.12 - - [31/Jan/2012:23:59:55 +0800] "GET /popwin_js.php?fid=53 HTTP/1.1" 404 289 "http://f.dataguru.cn/forum.php?mod=forumdisplay&fid=53&page=1" "Mozilla/5.0 (Windows NT 5.1; rv:8.0.1) Gecko/20100101 Firefox/8.0.1"

119.146.220.12 - - [31/Jan/2012:23:59:55 +0800] "GET /static/js/smilies.js?AZH HTTP/1.1" 304 - "http://f.dataguru.cn/forum.php?mod=forumdisplay&fid=53&page=1" "Mozilla/5.0 (Windows NT 5.1; rv:8.0.1) Gecko/20100101 Firefox/8.0.1"

119.146.220.12 - - [31/Jan/2012:23:59:55 +0800] "GET /data/cache/common_smilies_var.js?AZH HTTP/1.1" 304 - "http://f.dataguru.cn/forum.php?mod=forumdisplay&fid=53&page=1" "Mozilla/5.0 (Windows NT 5.1; rv:8.0.1) Gecko/20100101 Firefox/8.0.1"

數據算法:

這是一部分dataguru上網日誌,從日誌內容結構看,ip地址是放在前面的,我們只要抽取出ip地址寫入一張ip_text表,然後對ip列進行分組相當於分成若干個小表,每個ip集合爲一個小表,再單獨算出每個小表總行數即ip點擊次數。


(1)加載HDFS文件系統中access_log.txt文件內容放到pig的一個關係(表)裏,使用空格作爲分隔符,只加載ip列即可。

grunt> ip_text = LOAD 'pig/access_log.txt' USING PigStorage(' ') AS (ip:chararray);

ip_text:代表一個關係,一個表,一個變量,這個表中存放了所有ip記錄

LOAD 'pig/access_log.txt':要加載的文件

USING PigStorage(' '):使用空格作爲分隔符

ip:chararray:表中第一列名ip,數據類型chararray字符型


(2)查看ip_text表結構與內容

一定要仔細,例如命令結尾符不要丟掉,當我們執行一條pig latin語句時,pig自動轉換成MapReduce作業對用戶來說是透明的,先創建一個jar包,再提交MR job,生成Hadoop job id在執行,最後顯示結果!

grunt> DESCRIBE ip_text;                    顯示錶的結構,只有一列,類型爲字符型

ip_text: {ip: chararray}

grunt> DUMP ip_text;                       顯示錶的內容,只截取部分內容

creating jar file Job2594979755419279957.jar

1 map-reduce job(s) waiting for submission

HadoopJobId: job_201210121146_0002

(119.146.220.12)

(180.153.227.41)

(180.153.227.44)

(180.153.227.44)

(180.153.227.44)

(221.194.180.166)

(119.146.220.12)

(119.146.220.12)

(119.146.220.12)

(119.146.220.12)

(119.146.220.12)

(119.146.220.12)

(119.146.220.12)

(119.146.220.12)

(119.146.220.12)

(119.146.220.12)

(119.146.220.12)

(119.146.220.12)

(119.146.220.12)

(220.181.94.221)

(119.146.220.12)

(119.146.220.12)

(119.146.220.12)

(119.146.220.12)

(119.146.220.12)

(119.146.220.12)


(3)對ip列進行分組,並查看分組後表的內容和結構,注意關鍵字大小寫

把每個ip集合分成一個個小表,把分組後的結果存放在 group_ip 這個表中

grunt> group_ip = GROUP ip_text BY ip;                按照ip進行分組賦給group_ip表

grunt> DESCRIBE group_ip;                          查看group_ip表結構

group_ip: {group: chararray,ip_text: {(ip: chararray)}}

我們一眼就看出group_ip表是一個嵌套表,第一個field是group,這就是分組後的ip值

第二個field是一個嵌套的小表又叫包,是前面分組ip的整個集合

grunt> DUMP group_ip;                             又提交一個MR job運行

Pig script. settings are added to the job                 Pig腳本自動轉換MR job

creating jar file Job2785495206577164389.jar           創建jar包

jar file Job2785495206577164389.jar created            jar包創建完畢

map-reduce job(s) waiting for submission.              提交job

HadoopJobId: job_201210121146_0003                job id:job_201210121146_0003

(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),


(4)統計每個小表總行數即ip點擊次數

grunt> count_ip = FOREACH group_ip GENERATE group,COUNT($1) AS count_ip;

FOREACH group_ip:逐行掃描group_ip表,賦給count_ip表

GENERATE group:讀取分組ip值

COUNT($1) AS count_ip:統計嵌套小表(包)總行數即ip點擊次數,把此列取別名叫count_ip方便倒序排列,$1統計第一列,等價於COUNT(ip_text.ip)

grunt> sort_count_ip = ORDER count_ip BY count_ip DESC;  按照count_ip列從大到小排序

# grunt> sort_count_ip = ORDER count_ip BY count_ip ASC;  從小到大排序


(5)查看sort_count_ip表結構和內容

grunt> DESCRIBE sort_count_ip;                顯示錶的結構,有二列

sort_count_ip: {group: chararray,count_ip: long}   第一個field是group字符型(分組ip值),第二個field是count_ip長類型(ip點擊次數)

grunt> DUMP sort_count_ip;    顯示錶的內容,只截取部分結果,先輸出統計信息後顯示結果

HadoopVersion    PigVersion    UserId    StartedAt      FinishedAt    Features

0.20.2    0.9.2      grid 2012-11-03 21:13:05  2012-11-03 21:18:39  GROUP_BY,ORDER_BY

Success!

Input(s):

Successfully read 28134 records (7118627 bytes) from: "hdfs://h1:9000/user/grid/pig/access_log.txt"


Output(s):

Successfully stored 476 records (14515 bytes) in: "hdfs://h1:9000/tmp/temp1703385752/tmp-1916755802"

Counters:

Total records written : 476

Total bytes written : 14515

Spillable Memory Manager spill count : 0

Total bags proactively spilled: 0

Total records proactively spilled: 0


Job DAG:

job_201210121146_0004 ->    job_201210121146_0005,

job_201210121146_0005 ->    job_201210121146_0006,

job_201210121146_0006

(218.20.24.203,4597)

(221.194.180.166,4576)

(119.146.220.12,1850)

(117.136.31.144,1647)

(121.28.95.48,1597)

(113.109.183.126,1596)

(182.48.112.2,870)

(120.84.24.200,773)

(61.144.125.162,750)

(27.115.124.75,470)

(115.236.48.226,439)

(59.41.62.100,339)

(89.126.54.40,305)

(114.247.10.132,243)

(125.46.45.78,236)

(220.181.94.221,205)

(218.19.42.168,181)

(118.112.183.164,179)

(116.235.194.89,171)


(6)把sort_count_ip表內容寫入HDFS文件系統中,即固化到硬盤存入文件

grunt> STORE sort_count_ip INTO 'pig/sort_count_ip';

Counters:

Total records written : 476

Total bytes written : 8051

Spillable Memory Manager spill count : 0

Total bags proactively spilled: 0

Total records proactively spilled: 0


Job DAG:

job_201210121146_0007 ->    job_201210121146_0008,

job_201210121146_0008 ->    job_201210121146_0009,

job_201210121146_0009


2012-11-03 21:28:41,520 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!

當我們看到Success時就說明我們已經保存成功!


(7)查看保存在HDFS中的結果文件

grunt> ls

hdfs://h1:9000/user/grid/in    <dir>

hdfs://h1:9000/user/grid/out1       <dir>

hdfs://h1:9000/user/grid/out2       <dir>

hdfs://h1:9000/user/grid/pig  <dir>

grunt> cd pig

grunt> ls

hdfs://h1:9000/user/grid/pig/access_log.txt<r 2>    7118627

hdfs://h1:9000/user/grid/pig/sort_count_ip      <dir>

grunt> cat sort_count_ip

218.20.24.203     4597

221.194.180.166 4576

119.146.220.12   1850

117.136.31.144   1647

121.28.95.48       1597

113.109.183.126 1596

182.48.112.2       870

120.84.24.200     773

61.144.125.162   750

27.115.124.75     470

115.236.48.226   439

59.41.62.100       339

89.126.54.40       305

114.247.10.132   243

125.46.45.78       236

220.181.94.221   205

218.19.42.168     181

118.112.183.164 179

116.235.194.89   171

綜上我們圓滿完成了本次任務

參考文獻

http://f.dataguru.cn/forum.php?mod=viewthread&tid=27593&fromuid=303   casliyang

http://f.dataguru.cn/thread-26828-1-3.html   sunev_yu

http://f.dataguru.cn/forum.php?mod=viewthread&tid=27866&fromuid=303   chengat1314

http://f.dataguru.cn/thread-27576-1-2.html   camel21

http://www.cnblogs.com/siwei1988/archive/2012/07/23/2604710.html


發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章