Linux ISO:CentOS-6.0-i386-bin-DVD.iso 32位
JDK version:"1.6.0_25-ea"
Hadoop software version:hadoop-0.20.205.0.tar.gz
Hbase version:hbase-0.90.5
Pig version:pig-0.9.2.tar.gz http://mirror.bjtu.edu.cn/apache/pig/pig-0.9.2/pig-0.9.2.tar.gz 北京大學的apache鏡像下載,這個版本呢其實不是最新的但和hadoop0.20.2版本匹配,pig版本與hadoop版本也是有配達要求的,請注意你自己的安裝的hadoop版本是啥,上網搜索一下對應的版本,當然從上面的網站上可以下載pig全系列,例如pig-0.10.0.tar.gz 這裏我就不一一舉例了。
2.Pig安裝模式
Local模式:實際就是單機模式,pig只能訪問本地一臺主機,沒有分佈式,甚至可以不用安裝hadoop,所有的命令執行和文件讀寫都在本地進行,常用於作業實驗。
Local模式:只需要配置export
PATH=/usr/java/jdk1.6.0_25/bin:/home/grid/hadoop-0.20.2/bin:/home/grid/pig-0.9.2/bin:$PATH 1個環境變量即可
MapReduce模式:這種模式纔是實際應用中的工作模式,它可以將文件上傳到HDFS系統中,在使用pig latin語言運行作業時,可以將作業分佈在hadoop集羣中完成,這也體現了MapReduce的思想,這樣我們通過pig客戶端連接hadoop集羣進行數據管理和分析工作。
需要配置PATH PIG_CLASSPATH hosts文件 啓動pig
本次主要介紹MapReduce模式安裝,因爲這種安裝模式在實際中最常用也是最有意義的。
Pig作爲hadoop的客戶端,Pig安裝包可以安裝在集羣任何節點上,它可以在任何節點上提交作業,我這次安裝在master節點上爲了是方便了解部署架構。
3.驗證Hadoop集羣狀態
使用shell命令行方式驗證
[grid@h1 hadoop-0.20.2]$ bin/hadoop dfsadmin -report
Configured Capacity: 19865944064 (18.5 GB)
Present Capacity: 8833888256 (8.23 GB)
DFS Remaining: 8833495040 (8.23 GB)
DFS Used: 393216 (384 KB)
DFS Used%: 0%
Under replicated blocks: 4
Blocks with corrupt replicas: 0
Missing blocks: 0
-------------------------------------------------
Datanodes available: 2 (2 total, 0 dead) --2個節點存活無shutdown
Name: 192.168.2.103:50010 -- slaves h2
Decommission Status : Normal --狀態正常
Configured Capacity: 9932972032 (9.25 GB)
DFS Used: 196608 (192 KB)
Non DFS Used: 5401513984 (5.03 GB)
DFS Remaining: 4531261440(4.22 GB)
DFS Used%: 0%
DFS Remaining%: 45.62%
Last contact: Fri Nov 02 18:58:02 CST 2012
Name: 192.168.2.105:50010 -- slaves h4
Decommission Status : Normal --狀態正常
Configured Capacity: 9932972032 (9.25 GB)
DFS Used: 196608 (192 KB)
Non DFS Used: 5630541824 (5.24 GB)
DFS Remaining: 4302233600(4.01 GB)
DFS Used%: 0%
DFS Remaining%: 43.31%
Last contact: Fri Nov 02 18:58:02 CST 2012
[grid@h1hadoop-0.20.2]$ jps master -> hadoop和hbase都啓動了
22926 HQuorumPeer
4709 JobTracker
22977 HMaster
4515 NameNode
4650 SecondaryNameNode
31681 Jps
[grid@h2tmp]$ jps slave1 -> hadoop和hbase都啓動了
17188 TaskTracker
22181 Jps
13800 HRegionServer
13727 HQuorumPeer
17077 DataNode
[grid@h4logs]$ jps slave2 -> hadoop和hbase都啓動了
27829 TaskTracker
19978 Jps
26875 Jps
17119 DataNode
11636 HRegionServer
11557 HQuorumPeer
4.Pig安裝與配置
(1)把pig-0.9.2.tar.gz上傳到h1:/home/grid/目錄下並tar解包
[grid@h1 grid]$ pwd
/home/grid
[grid@h1 grid]$ ll
總用量 46832
-rwxrwxrwx. 1 grid hadoop 44 9月 18 19:10 abc.txt
-rwxrwxrwx. 1 grid hadoop 5519 10月 12 22:09 Exercise_1.jar
drwxr-xr-x. 14 grid hadoop 4096 9月 18 07:05 hadoop-0.20.2
drwxr-xr-x. 10 grid hadoop 4096 10月 28 21:13 hbase-0.90.5
-rwxrw-rw-. 1 grid hadoop 47875717 11月 2 06:44 pig-0.9.2.tar.gz
[grid@h1 grid]$ tar -zxvf pig-0.9.2.tar.gz
[grid@h1 grid]$ ll
總用量 46836
-rwxrwxrwx. 1 grid hadoop 44 9月 18 19:10 abc.txt
-rwxrwxrwx. 1 grid hadoop 5519 10月 12 22:09 Exercise_1.jar
drwxr-xr-x. 14 grid hadoop 4096 9月 18 07:05 hadoop-0.20.2
drwxr-xr-x. 10 grid hadoop 4096 10月 28 21:13 hbase-0.90.5
drwxr-xr-x. 2 grid hadoop 4096 9月 16 19:57 input
drwxr-xr-x. 15 grid hadoop 4096 1月 18 2012 pig-0.9.2
-rwxrw-rw-. 1 grid hadoop 47875717 11月 2 06:44 pig-0.9.2.tar.gz
(2)配置Pig的環境變量 紅色字體都是要修改的
[grid@h1 grid]$ vim .bashrc
export JAVA_HOME=/usr --不要寫java目錄本身,要寫上級目錄才生效
export JRE_HOME=/usr/java/jdk1.6.0_25/jre
export
PATH=/usr/java/jdk1.6.0_25/bin:/home/grid/hadoop-0.20.2/bin:/home/grid/pig-0.9.2/bin:$PATH
--添加hadoop軟件命令目錄和pig軟件命令目錄,作用告訴shell命令行到哪個目錄下去找命令or程序
export CLASSPATH=./:/usr/java/jdk1.6.0_25/lib:/usr/java/jdk1.6.0_25/jre/lib
export PIG_CLASSPATH=/home/grid/hadoop-0.20.2/conf --既然是MapReduce模式,就要讓Pig軟件找到Hadoop集羣,這裏是告訴pig軟件hadoop的配置文件在哪裏,通過一系列配置文件core_site.xml hdfs-site.xml mapred-site.xml 可以找到關鍵參數NameNode 和 JobTracker 的位置以及端口信息,有了這些信息就可以對整個集羣進行控制了。
方法二 編輯/home/grid/pig-0.9.2/conf /pig.properties 也可以啓動MapReduce模式
添加
fs.default.name= hdfs://h1:9000 找到namenode信息
mapred.job.tracker= h1:9001 找到jobtracker信息
(3)使環境變量生效
[grid@h1 grid]$ source .bashrc 加載環境變量使之生效
(4)查看hosts文件
[grid@h1 grid]$ cat /etc/hosts
192.168.2.102 h1 # Added by NetworkManager
127.0.0.1 localhost.localdomain localhost
::1 h1 localhost6.localdomain6 localhost6
192.168.2.102 h1
192.168.2.103 h2
192.168.2.105 h4
這個文件是主機名和IP地址映射文件,一般在Hadoop集羣中都使用主機名進行通信的,在配置文件中也使用主機名進行配置。
(5)啓動pig
[grid@h1 grid]$ pig -x mapreduce 也可以只用pig命令進入shell
2012-11-02 20:09:22,149 [main] INFO org.apache.pig.Main - Logging error messages to: /home/grid/pig_1351858162147.log
2012-11-02 20:09:23,314 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://h1:9000 --pig找到namenode
2012-11-02 20:09:27,950 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at: h1:9001 --pig找到jobtracker
grunt> quit 退出pig客戶端
[grid@h1 grid]$
[grid@h1 grid]$ pig 也可以只用pig命令進入shell
2012-11-02 20:16:17,968 [main] INFO org.apache.pig.Main - Logging error messages to: /home/grid/pig_1351858577966.log
2012-11-02 20:16:18,100 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://h1:9000
2012-11-02 20:16:18,338 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at: h1:9001
grunt> help 幫助命令列表
Commands:
<pig latin statement>; - See the PigLatin manual for details:http://hadoop.apache.org/pig
File system commands:
fs <fs arguments> - Equivalent to Hadoop dfs command:http://hadoop.apache.org/common/docs/current/hdfs_shell.html
Diagnostic commands:
describe <alias>[::<alias] - Show the schema for the alias. Inner aliases can be described as A::B.
explain [-script. <pigscript>] [-out <path>] [-brief] [-dot] [-param <param_name>=<param_value>]
[-param_file <file_name>] [<alias>] - Show the execution plan to compute the alias or for entire script.
-script. - Explain the entire script.
-out - Store the output into directory rather than print to stdout.
-brief - Don't expand nested plans (presenting a smaller graph for overview).
-dot - Generate the output in .dot format. Default is text format.
-param <param_name - See parameter substitution for details.
-param_file <file_name> - See parameter substitution for details.
alias - Alias to explain.
dump <alias> - Compute the alias and writes the results to stdout.
Utility Commands:
exec [-param <param_name>=param_value] [-param_file <file_name>] <script> -
Execute the script. with access to grunt environment including aliases.
-param <param_name - See parameter substitution for details.
-param_file <file_name> - See parameter substitution for details.
script. - Script. to be executed.
run [-param <param_name>=param_value] [-param_file <file_name>] <script> -
Execute the script. with access to grunt environment.
-param <param_name - See parameter substitution for details.
-param_file <file_name> - See parameter substitution for details.
script. - Script. to be executed.
kill <job_id> - Kill the hadoop job specified by the hadoop job id.
set <key> <value> - Provide execution parameters to Pig. Keys and values are case sensitive.
The following keys are supported:
default_parallel - Script-level reduce parallelism. Basic input size heuristics used by default.
debug - Set debug on or off. Default is off.
job.name - Single-quoted name for jobs. Default is PigLatin:<script. name>
job.priority - Priority for jobs. Values: very_low, low, normal, high, very_high. Default is normal
stream.skippath - String that contains the path. This is used by streaming.
any hadoop property.
help - Display this message.
quit - Quit the grunt shell.
(6)pig操作命令
自動補全機制(大小寫敏感):就跟linux中的命令自動補全一樣,當你輸入一半的命令按住tab鍵就可以輸出整個命令,但不能補全文件名哦!
grunt> ls 顯示根目錄的內容
hdfs://h1:9000/user/grid/in <dir> dir表示目錄的意思 <r 3>表示文件的意思
hdfs://h1:9000/user/grid/out1 <dir>
hdfs://h1:9000/user/grid/out2 <dir>
grunt> cd in 進入in子目錄
grunt> ls
hdfs://h1:9000/user/grid/in/test_1<r 3> 324 324個字節
hdfs://h1:9000/user/grid/in/test_2<r 3> 134 134個字節
grunt> cat test_1 顯示test_1文件內容
Apr 23 11:49:54 hostapd: wlan0: STA 14:7d:c5:9e:fb:84
Apr 23 11:49:52 hostapd: wlan0: STA 74:e5:0b:04:28:f2
Apr 23 11:49:50 hostapd: wlan0: STA cc:af:78:cc:d5:5d
Apr 23 11:49:44 hostapd: wlan0: STA cc:af:78:cc:d5:5d
Apr 23 11:49:43 hostapd: wlan0: STA 74:e5:0b:04:28:f2
Apr 23 11:49:42 hostapd: wlan0: STA 14:7d:c5:9e:fb:84
grunt> cat test_2 顯示test_2文件內容
13599999999 10086
13899999999 120
13944444444 13800138000
13722222222 13800138000
18800000000 120
13722222222 10086
18944444444 10086
在grunt>中全是絕對路徑,沒有相對路徑的顯示
在grunt>中引入了當前目錄的概念,可以對當前目錄進行記憶和管理
在grunt>中直接對HDFS文件系統操作,不用在寫煩瑣的HDFS命令了
copyFromLocal 把操作系統中的東西->拷貝->HDFS文件系統中
grunt> copyFromLocal /home/grid/access_log.txt pig/access_log.txt
grunt> ls
hdfs://h1:9000/user/grid/in <dir>
hdfs://h1:9000/user/grid/out1 <dir>
hdfs://h1:9000/user/grid/out2 <dir>
hdfs://h1:9000/user/grid/pig <dir>
grunt> cd pig
grunt> ls
hdfs://h1:9000/user/grid/pig/access_log.txt<r 2> 7118627 字節數對的上
copyToLocal 把HDFS文件系統中的東西->拷貝->操作系統中
grunt> copyToLocal test_1 ttt
grunt> ls
hdfs://h1:9000/user/grid/in/test_1<r 3> 324
hdfs://h1:9000/user/grid/in/test_2<r 3> 134
[grid@h1 grid]$ cat ttt 完美拷貝
Apr 23 11:49:54 hostapd: wlan0: STA 14:7d:c5:9e:fb:84
Apr 23 11:49:52 hostapd: wlan0: STA 74:e5:0b:04:28:f2
Apr 23 11:49:50 hostapd: wlan0: STA cc:af:78:cc:d5:5d
Apr 23 11:49:44 hostapd: wlan0: STA cc:af:78:cc:d5:5d
Apr 23 11:49:43 hostapd: wlan0: STA 74:e5:0b:04:28:f2
Apr 23 11:49:42 hostapd: wlan0: STA 14:7d:c5:9e:fb:84
sh 命令 在grunt>裏直接運行操作系統命令
grunt> sh pwd
/home/grid
grunt> sh cat ttt
Apr 23 11:49:54 hostapd: wlan0: STA 14:7d:c5:9e:fb:84
Apr 23 11:49:52 hostapd: wlan0: STA 74:e5:0b:04:28:f2
Apr 23 11:49:50 hostapd: wlan0: STA cc:af:78:cc:d5:5d
Apr 23 11:49:44 hostapd: wlan0: STA cc:af:78:cc:d5:5d
Apr 23 11:49:43 hostapd: wlan0: STA 74:e5:0b:04:28:f2
Apr 23 11:49:42 hostapd: wlan0: STA 14:7d:c5:9e:fb:84
5.Pig案例應用
題目:請使用Pig latin語言處理access_log.txt日誌,計算出每個IP的點擊數。
我們看一下命令列表,下面是我們常用的pig latin語言
<EOF>
"cat" ...
"fs" ...
"sh" ...
"cd" ...
"cp" ...
"copyFromLocal" ...
"copyToLocal" ...
"dump" ...
"describe" ...
"aliases" ...
"explain" ...
"help" ...
"kill" ...
"ls" ...
"mv" ...
"mkdir" ...
"pwd" ...
"quit" ...
"register" ...
"rm" ...
"rmf" ...
"set" ...
"illustrate" ...
"run" ...
"exec" ...
"scriptDone" ...
"" ...
<EOL> ...
";" ...
grunt> pwd
hdfs://h1:9000/user/grid/pig
grunt> ls
hdfs://h1:9000/user/grid/pig/access_log.txt<r 2> 7118627 這就是我們要處理的文件
grunt> cat access_log.txt 我們來看一下文件的內容之後進行數據分析
119.146.220.12 - - [31/Jan/2012:23:59:51 +0800] "GET /static/js/jquery-1.6.js HTTP/1.1" 404 299 "http://f.dataguru.cn/forum.php?mod=forumdisplay&fid=53&page=1" "Mozilla/5.0 (Windows NT 5.1; rv:8.0.1) Gecko/20100101 Firefox/8.0.1"
119.146.220.12 - - [31/Jan/2012:23:59:52 +0800] "GET /static/js/floating-jf.js HTTP/1.1" 404 300 "http://f.dataguru.cn/forum.php?mod=forumdisplay&fid=53&page=1" "Mozilla/5.0 (Windows NT 5.1; rv:8.0.1) Gecko/20100101 Firefox/8.0.1"
119.146.220.12 - - [31/Jan/2012:23:59:55 +0800] "GET /popwin_js.php?fid=53 HTTP/1.1" 404 289 "http://f.dataguru.cn/forum.php?mod=forumdisplay&fid=53&page=1" "Mozilla/5.0 (Windows NT 5.1; rv:8.0.1) Gecko/20100101 Firefox/8.0.1"
119.146.220.12 - - [31/Jan/2012:23:59:55 +0800] "GET /static/js/smilies.js?AZH HTTP/1.1" 304 - "http://f.dataguru.cn/forum.php?mod=forumdisplay&fid=53&page=1" "Mozilla/5.0 (Windows NT 5.1; rv:8.0.1) Gecko/20100101 Firefox/8.0.1"
119.146.220.12 - - [31/Jan/2012:23:59:55 +0800] "GET /data/cache/common_smilies_var.js?AZH HTTP/1.1" 304 - "http://f.dataguru.cn/forum.php?mod=forumdisplay&fid=53&page=1" "Mozilla/5.0 (Windows NT 5.1; rv:8.0.1) Gecko/20100101 Firefox/8.0.1"
數據算法:
這是一部分dataguru上網日誌,從日誌內容結構看,ip地址是放在前面的,我們只要抽取出ip地址寫入一張ip_text表,然後對ip列進行分組相當於分成若干個小表,每個ip集合爲一個小表,再單獨算出每個小表總行數即ip點擊次數。
(1)加載HDFS文件系統中access_log.txt文件內容放到pig的一個關係(表)裏,使用空格作爲分隔符,只加載ip列即可。
grunt> ip_text = LOAD 'pig/access_log.txt' USING PigStorage(' ') AS (ip:chararray);
ip_text:代表一個關係,一個表,一個變量,這個表中存放了所有ip記錄
LOAD 'pig/access_log.txt':要加載的文件
USING PigStorage(' '):使用空格作爲分隔符
ip:chararray:表中第一列名ip,數據類型chararray字符型
(2)查看ip_text表結構與內容
一定要仔細,例如命令結尾符不要丟掉,當我們執行一條pig latin語句時,pig自動轉換成MapReduce作業對用戶來說是透明的,先創建一個jar包,再提交MR job,生成Hadoop job id在執行,最後顯示結果!
grunt> DESCRIBE ip_text; 顯示錶的結構,只有一列,類型爲字符型
ip_text: {ip: chararray}
grunt> DUMP ip_text; 顯示錶的內容,只截取部分內容
creating jar file Job2594979755419279957.jar
1 map-reduce job(s) waiting for submission
HadoopJobId: job_201210121146_0002
(119.146.220.12)
(180.153.227.41)
(180.153.227.44)
(180.153.227.44)
(180.153.227.44)
(221.194.180.166)
(119.146.220.12)
(119.146.220.12)
(119.146.220.12)
(119.146.220.12)
(119.146.220.12)
(119.146.220.12)
(119.146.220.12)
(119.146.220.12)
(119.146.220.12)
(119.146.220.12)
(119.146.220.12)
(119.146.220.12)
(119.146.220.12)
(220.181.94.221)
(119.146.220.12)
(119.146.220.12)
(119.146.220.12)
(119.146.220.12)
(119.146.220.12)
(119.146.220.12)
(3)對ip列進行分組,並查看分組後表的內容和結構,注意關鍵字大小寫
把每個ip集合分成一個個小表,把分組後的結果存放在 group_ip 這個表中
grunt> group_ip = GROUP ip_text BY ip; 按照ip進行分組賦給group_ip表
grunt> DESCRIBE group_ip; 查看group_ip表結構
group_ip: {group: chararray,ip_text: {(ip: chararray)}}
我們一眼就看出group_ip表是一個嵌套表,第一個field是group,這就是分組後的ip值
第二個field是一個嵌套的小表又叫包,是前面分組ip的整個集合
grunt> DUMP group_ip; 又提交一個MR job運行
Pig script. settings are added to the job Pig腳本自動轉換MR job
creating jar file Job2785495206577164389.jar 創建jar包
jar file Job2785495206577164389.jar created jar包創建完畢
map-reduce job(s) waiting for submission. 提交job
HadoopJobId: job_201210121146_0003 job id:job_201210121146_0003
(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),(221.194.180.166),
(4)統計每個小表總行數即ip點擊次數
grunt> count_ip = FOREACH group_ip GENERATE group,COUNT($1) AS count_ip;
FOREACH group_ip:逐行掃描group_ip表,賦給count_ip表
GENERATE group:讀取分組ip值
COUNT($1) AS count_ip:統計嵌套小表(包)總行數即ip點擊次數,把此列取別名叫count_ip方便倒序排列,$1統計第一列,等價於COUNT(ip_text.ip)
grunt> sort_count_ip = ORDER count_ip BY count_ip DESC; 按照count_ip列從大到小排序
# grunt> sort_count_ip = ORDER count_ip BY count_ip ASC; 從小到大排序
(5)查看sort_count_ip表結構和內容
grunt> DESCRIBE sort_count_ip; 顯示錶的結構,有二列
sort_count_ip: {group: chararray,count_ip: long} 第一個field是group字符型(分組ip值),第二個field是count_ip長類型(ip點擊次數)
grunt> DUMP sort_count_ip; 顯示錶的內容,只截取部分結果,先輸出統計信息後顯示結果
HadoopVersion PigVersion UserId StartedAt FinishedAt Features
0.20.2 0.9.2 grid 2012-11-03 21:13:05 2012-11-03 21:18:39 GROUP_BY,ORDER_BY
Success!
Input(s):
Successfully read 28134 records (7118627 bytes) from: "hdfs://h1:9000/user/grid/pig/access_log.txt"
Output(s):
Successfully stored 476 records (14515 bytes) in: "hdfs://h1:9000/tmp/temp1703385752/tmp-1916755802"
Counters:
Total records written : 476
Total bytes written : 14515
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0
Job DAG:
job_201210121146_0004 -> job_201210121146_0005,
job_201210121146_0005 -> job_201210121146_0006,
job_201210121146_0006
(218.20.24.203,4597)
(221.194.180.166,4576)
(119.146.220.12,1850)
(117.136.31.144,1647)
(121.28.95.48,1597)
(113.109.183.126,1596)
(182.48.112.2,870)
(120.84.24.200,773)
(61.144.125.162,750)
(27.115.124.75,470)
(115.236.48.226,439)
(59.41.62.100,339)
(89.126.54.40,305)
(114.247.10.132,243)
(125.46.45.78,236)
(220.181.94.221,205)
(218.19.42.168,181)
(118.112.183.164,179)
(116.235.194.89,171)
(6)把sort_count_ip表內容寫入HDFS文件系統中,即固化到硬盤存入文件
grunt> STORE sort_count_ip INTO 'pig/sort_count_ip';
Counters:
Total records written : 476
Total bytes written : 8051
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0
Job DAG:
job_201210121146_0007 -> job_201210121146_0008,
job_201210121146_0008 -> job_201210121146_0009,
job_201210121146_0009
2012-11-03 21:28:41,520 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!
當我們看到Success時就說明我們已經保存成功!
(7)查看保存在HDFS中的結果文件
grunt> ls
hdfs://h1:9000/user/grid/in <dir>
hdfs://h1:9000/user/grid/out1 <dir>
hdfs://h1:9000/user/grid/out2 <dir>
hdfs://h1:9000/user/grid/pig <dir>
grunt> cd pig
grunt> ls
hdfs://h1:9000/user/grid/pig/access_log.txt<r 2> 7118627
hdfs://h1:9000/user/grid/pig/sort_count_ip <dir>
grunt> cat sort_count_ip
218.20.24.203 4597
221.194.180.166 4576
119.146.220.12 1850
117.136.31.144 1647
121.28.95.48 1597
113.109.183.126 1596
182.48.112.2 870
120.84.24.200 773
61.144.125.162 750
27.115.124.75 470
115.236.48.226 439
59.41.62.100 339
89.126.54.40 305
114.247.10.132 243
125.46.45.78 236
220.181.94.221 205
218.19.42.168 181
118.112.183.164 179
116.235.194.89 171
綜上我們圓滿完成了本次任務
參考文獻
http://f.dataguru.cn/forum.php?mod=viewthread&tid=27593&fromuid=303 casliyang
http://f.dataguru.cn/thread-26828-1-3.html sunev_yu
http://f.dataguru.cn/forum.php?mod=viewthread&tid=27866&fromuid=303 chengat1314
http://f.dataguru.cn/thread-27576-1-2.html camel21
http://www.cnblogs.com/siwei1988/archive/2012/07/23/2604710.html