Hadoop Streaming思考總結

文章目錄

Hadoop Streaming

Hadoop streaming is a utility that comes with the Hadoop distribution. The utility allows you to create and run Map/Reduce jobs with any executable or script as the mapper and/or the reducer

Hadoop streaming是一個Hadoop自帶的工具，可以允許用戶用任何可執行文件或腳本來作爲mapper/reducer進而創建和運行MapReduce作業。例如
hadoop jar hadoop-streaming-2.6.5.jar -input /user/root/input1/int.txt -output /user/root/streaming_out -mapper /bin/cat -reducer /usr/bin/wc

How Streaming Works

上面的例子中，mapper/reducer都是可執行文件，從標準輸入(stdin)一行行讀取數據，通過標準輸出(stdout)輸出數據。這裏是和Java的MapReduce完全不同的地方，mapper和reducer都是從標準輸入一行行讀取，而不是像map方法有明顯的key/value，也不是像reduce方法有明顯的<key, list< value >>。
該工具會創建一個MapReduce job，提交job到集羣上並且監控job狀態信息和進度直到job完成。也可以通過 -background 來設置提交job後就立刻返回，不用監聽job的進度。

當一個可執行文件作爲mapper時，mapper初始化時，每一個mapper任務會把該可執行文件作爲一個單獨進程啓動。mapper任務運行時，它把輸入按行劃分並把每一行提供給該可執行文件進程的標準輸入。
同時，mapper收集可執行文件進程標準輸出的內容，並把收到的每一行內容轉化成key/value對來作爲mapper的輸出。默認情況下，一行中第一個tab之前的部分作爲key，之後的（不包括tab）作爲value。如果沒有tab，整行作爲key值，value值爲null。然後也可以通過 -inputformat 來自定義輸入格式，後面再討論。
reducer的情況和mapper的過程是類似的，可以通過 -outputformat 來自定義輸出格式，後面再討論。

以上是Map/Reduce框架和streaming mapper/reducer之間的通信協議的基礎。
用戶可以設置 stream.non.zero.exit.is.failure 來定義一個以非0碼退出的streaming task是成功還是失敗的。默認設置爲true，即退出碼非0表示streaming task失敗。

Streaming Command Options

Streaming既支持streaming command，也支持generic command。通用的命令行格式如下
hadoop command [genericOptions] [streamingOptions]
注意genericOptions一定要放在streamingOptions之前，否則任務會失敗。

streaming支持如下streamingOptions配置，也可通過hadoop jar hadoop-streaming-2.6.5.jar -info獲取更加詳細的信息

Parameter	Optional/Required	Description
-input directoryname or filename	Required	job的輸入，可以指定多個`-input`
-output directoryname	Required	job的輸出
-mapper executable or JavaClassName	Required	Mapper 可執行文件
-reducer executable or JavaClassName	Required	Reducer 可執行文件
-file filename	Optional	讓mapper, reducer, or combiner 可執行文件能在計算節點上得到，即分發可執行文件到計算節點。已過時，可用`genericOptions`的-files來代替
-inputformat JavaClassName	Optional	自定義的Inputformat，但是需要返回`Text`類型的<k,v>對. `TextInputFormat` 是默認的Inputformat
-outputformat JavaClassName	Optional	自定義的Outputformat需要處理輸入類型爲Text的`<key, (list of values)>`. `TextOutputformat`是默認的Outputformat
-partitioner JavaClassName	Optional	和Partitioner組件作用一樣，決定key被送到哪個reduce去處理
-combiner streamingCommand or JavaClassName	Optional	處理map輸出的Combiner 可執行文件，combiner能有效地降低帶寬即提高shuffle效率，但是combiner不能影響最終業務，如求中位數就不能用combiner。如數據爲`1 2 3 \| 3 4`
-cmdenv name=value	Optional	傳遞環境變量給streaming commands
-inputreader	Optional	For backwards-compatibility: specifies a record reader class (instead of an input format class)
-verbose	Optional	Verbose output
-lazyOutput	Optional	延遲創建輸出目錄，只有在`reduce`調用`context.write`方法時纔開始創建
-numReduceTasks	Optional	確定reduce task的數量
-mapdebug	Optional	map task失敗時調用的腳本（用於調試，參考MapReduce Tutorial的Debugging部分）
-reducedebug	Optional	reduce task失敗時調用的腳本，同上

Generic Command Options

Parameter	Optional/Required	Description
-conf configuration_file	Optional	確定application配置文件
-D property=value	Optional	可以修改參數如`-D mapreduce.job.reduces=0`
-fs host:port or local	Optional	確定namenode
-jt host:port or local	Optional	確定resourcemanager
-files	Optional	逗號分割的文件，會拷貝到Map/Reduce集羣
-libjars	Optional	逗號分割的jar文件會添加到classpath
-archives	Optional	逗號分割的archives會被解壓到各個計算節點

Specifying Map-Only Jobs

如果一個job只需要map不需要reduce可以通過 -D mapreduce.job.reduces=0 來設置，該選項也等同於 -reducer NONE(這是爲了向後兼容To be backward compatible)

單詞統計的例子

shell版本

map.sh腳本如下

#!/bin/bash

# 從標準輸入中一行行讀取數據
while read line;
do
    # 分割單詞
    for word in $line
    do
    ¦   # 需要-e來開啓輸出轉義字符
    ¦   echo -e "$word\t1"
    done
done

reduce.sh腳本如下

#!/bin/bash

count=0
started=0
word=""
while read line;
do
    # line的形式是 word\t1
    # 注意$line要加上雙引號, 否則無法識別從map輸出過來的TAB字符
    # cut 默認就是通過TAB來切割，-f 1就是獲取單詞 -f 2就是獲取單詞統計數
    newword=`echo -e "$line" | cut -f 1`
    newcount=`echo -e "$line" | cut -f 2`
    if [ "$word" != "$newword" ];then
    ¦   # 如果出現了新的單詞就輸出老單詞的統計數, cmd1 && cmd2 意思就是cmd1成功時纔會執行cmd2
    ¦   [ $started -ne 0 ] && echo -e "$word\t$count"
    ¦   word=$newword
    ¦   count=$newcount
    ¦   started=1
    else
    ¦   count=$(($count + $newcount))
    fi
done
# 輸出最後一個單詞統計
echo -e "$word\t$count"

執行命令如下，由於這裏是單詞統計可以使用-combiner來提高shuffle效率，減少數據的傳輸。
hadoop jar hadoop-streaming-2.6.5.jar -files map.sh,reduce.sh -input /user/root/steam/input -output /user/root/steam/output -mapper map.sh -reducer reduce.sh -combiner reduce.sh

Hadoop Steaming的一個優點就是非常容易調試
cat words.txt | ./map.sh | sort | ./reduce.sh

python版本

map.py腳本如下，注意開頭行#!/usr/bin/env python的#和！中間不能有空格，當時爲了格式好看，特意加了個空格和下面對齊，結果就是本地調試能成功但是提交到MapReduce框架上就運行失敗！！！

#!/usr/bin/env python
# encoding: utf-8
# **********************************************
#
#      Filename: map.py
#
#        Author: WangTian
#   Description: ---
#        Create: 2020-02-18 19:16:19
# Last Modified: 2020-02-18 19:16:19
#
# **********************************************


import sys

# 從標準輸入中一行行讀取數據
for line in sys.stdin:
    # 去掉首尾空格
    line = line.strip()
    # 根據空格劃分單詞
    words = line.split()
    for word in words:
        # print "{}\t{}".format(word, 1)
        print "%s\t%s" % (word, 1)

reduce.py腳本如下，請特別注意我腳本里面的註釋方法，原先我是打算存儲一個<word,count>的字典然後統計完後再遍歷字典，這樣不僅極大地浪費存儲空間，而且輸出的時候就丟失了MapReduce框架給我們輸出數據的排序性。

#!/usr/bin/env python
# encoding: utf-8
# **********************************************
#
#      Filename: reduce.py
#
#        Author: WangTian
#   Description: 單詞統計的reducer
#        Create: 2020-02-18 19:17:58
# Last Modified: 2020-02-18 19:17:58
#
# **********************************************

import sys

# 存儲<word,count>的字典 根本不需要, 這樣極大地浪費存儲空間
word2count = {}
# 需要臨時變量存儲key即word，也需要臨時變量count來統計
word = ""
count = 0
started = 0
# 從標準輸入一行行讀取數據
for line in sys.stdin:
    # 去除首尾空格
    line = line.strip()
    # 獲取單詞和單詞數
    newword, newcount = line.split()
    if word != newword:
        if started == 1:
            # 打印上一輪的單詞統計
            print "{}\t{}".format(word, count)
        word = newword
        count = int(newcount)
        started = 1
    else:
        count = count + int(newcount)
    # word2count[word] = word2count.get(word, 0) + int(count)

print "{}\t{}".format(word, count)
# 不準這樣輸出, 這樣會失去map輸出數據的順序性
# for word, count in word2count.items():
#     print "{}\t{}".format(word, count)

執行腳本命令如下，同樣添加了 combiner
hadoop jar hadoop-streaming-2.6.5.jar -files map.py,reduce.py -input /user/root/steam/input -output /user/root/steam/output -mapper map.py -reducer reduce.py -combiner reduce.py

調試命令如下 cat words.txt | python map.py | sort | python reduce.py

Customizing How Lines are Split into Key/Value Pairs–自定義分割<k,v>方式

正如上面 How Streaming Works 上提到的，Map/Reduce框架讀取mapper的輸出時也是一行行讀取，框架默認根據tab符將其分爲<k,v>對，tab符前面的就作爲key，tab符後面的就作爲value。
然而你也可以自定義分隔符並且指定從第nth(n >= 1)分隔符開始分隔。看下面給出的例子：

hadoop jar hadoop-streaming-2.6.5.jar \
    -D stream.map.output.field.separator=. \
    -D stream.num.map.output.key.fields=4 \
    -input myInputDirs \
    -output myOutputDir \
    -mapper /bin/cat \
    -reducer /bin/cat

-D stream.map.output.field.separator=. 就是指定分隔符，-D stream.num.map.output.key.fields=4 就是指定從第4個分隔符開始分隔，前面的是key，後面的是value。如果整行內容沒有第4個分隔符，那麼整行內容就是key，value是空內容(類似於new Text(""))。
類似地有如下參數(幾乎不用考慮用到)
-D stream.reduce.output.field.separator=SEP和-D stream.num.reduce.output.fields=NUM指定reduce的輸出的key和value
stream.map.input.field.separator和stream.reduce.input.field.separator指定input的輸入的key和value。

More Usage Examples

Hadoop Partitioner Class

-D mapreduce.map.output.key.field.separator 用來指定分區時用的分隔符，-D mapreduce.partition.keypartitioner.options=-k1,2 用來指定key被分隔符分隔後前兩個作爲partition，保證前兩個一樣的數據會被分配到同一個reduce task裏。例子如下
當然必須得添加 -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner 來指明Partitioner。

hadoop jar hadoop-streaming-2.6.5.jar \
    -D stream.map.output.field.separator=. \
    -D stream.num.map.output.key.fields=4 \
    -D mapreduce.map.output.key.field.separator=. \
    -D mapreduce.partition.keypartitioner.options=-k1,2 \
    -D mapreduce.job.reduces=12 \
    -input myInputDirs \
    -output myOutputDir \
    -mapper /bin/cat \
    -reducer /bin/cat \
    -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner

這實際上相當於將前兩個字段指定爲主鍵，將後兩個字段指定爲輔鍵。主鍵用於分區，主鍵和輔助鍵的組合用於排序

假設map的key輸出數據如下

11.12.1.2
11.14.2.3
11.11.4.1
11.12.1.1
11.14.2.2

分成了3個reduce，前兩個去做partition

11.11.4.1
-----------
11.12.1.2
11.12.1.1
-----------
11.14.2.3
11.14.2.2

Hadoop Comparator Class

通過設置 -D mapreduce.job.output.key.comparator.class=org.apache.hadoop.mapreduce.lib.partition.KeyFieldBasedComparator 來指定Comparator Class,

hadoop jar hadoop-streaming-2.6.5.jar \
    -D mapreduce.job.output.key.comparator.class=org.apache.hadoop.mapreduce.lib.partition.KeyFieldBasedComparator \
    -D stream.map.output.field.separator=. \
    -D stream.num.map.output.key.fields=4 \
    -D mapreduce.map.output.key.field.separator=. \
    -D mapreduce.partition.keycomparator.options=-k2,2nr \
    -D mapreduce.job.reduces=1 \
    -input myInputDirs \
    -output myOutputDir \
    -mapper /bin/cat \
    -reducer /bin/cat

-D mapreduce.partition.keycomparator.options=-k2,2nr 表示通過第二個字段進行數字的倒序排序。其中 -n 表示排序是numerical sorting，-r 表示是倒序

假設map的key輸出數據是

11.12.1.2
11.14.2.3
11.11.4.1
11.12.1.1
11.14.2.2

排序後結果如下

11.14.2.3
11.14.2.2
11.12.1.2
11.12.1.1
11.11.4.1

Hadoop Aggregate Package

請參考官網 Aggregate。

Hadoop Field Selection Class

請參考官網 Hadoop Field Selection Class。

總結

hadoop Streaming只是個工具，給予那些用其他語言開發比較簡單的MapReduce作業，更加複雜的還是選用Java語言去寫MapReduce。

Hadoop Streaming思考總結

文章目錄

Hadoop Streaming

How Streaming Works

Streaming Command Options

Generic Command Options

Specifying Map-Only Jobs

單詞統計的例子

shell版本

python版本

Customizing How Lines are Split into Key/Value Pairs–自定義分割<k,v>方式

More Usage Examples

Hadoop Partitioner Class

Hadoop Comparator Class

Hadoop Aggregate Package

Hadoop Field Selection Class

總結

如何使用 JS 判斷用戶是否處於活躍狀態

Mono 支持LoongArch架構

lightdb秒級增加列和刪除列（not null帶默認值）

lightdb數據庫超時相關控制參數

通過HPA+CronHPA組合應對業務複雜彈性伸縮場景

❤️‍🔥 Solon Cloud Event 新的事務特性與應用

lightdb mysql 8.0兼容之不可見主鍵

使用 JS 實現在瀏覽器控制檯打印圖片 console.image()

基於Ubuntu-22.04安裝K8s-v1.28.2實驗（四）使用域名訪問網站應用

Python簡單Web開發

python繼承（super()、多繼承、鑽石繼承）

MapReduce Tutorial 思考總結

HIVE學習二：hive on tez

pycharm配置flake8語法插件和autopep8代碼規範插件

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結