hadoop源碼分析，map輸出

Mapper 的輸入官方文檔如下

The Mapper outputs are sorted and then partitioned per Reducer. The total number of partitions is the same as the number of reduce tasks for the job. Users can control which keys (and hence records) go to which Reducer by implementing a custom Partitioner.

mapper的輸出是已經排序並且針對每個reducer劃分開的，那麼hadoop代碼是如何劃分的，這裏將跟從代碼分析。

還是根據官方示例WordCount的示例

第一次分析爲了簡化map的輸出複雜情況，

只分析一個文檔，並且其中只有10個'單詞'，分別爲“J", .."c", "b", "a" ( 這裏10個字母最好是亂序的，後面會看到其排序)，

註釋掉設置combine class的代碼。

1. 單步跟蹤map中的context.write（生產kvbuffer 和kvmeta）

可以追蹤到最終實際是由org.apache.hadoop.mapred.MapTask.MapOutputBuffer.collect(K, V, int）

這裏因爲我們的output 只有10個Record 且每個大小都比較小，所以跳過了spill了處理以及combine處理，主要代碼如下，

public synchronized void collect(K key, V value, final int partition ) throws IOException {

{

...

keySerializer.serialize(key);

...

valSerializer.serialize(value);

.... kvmeta.put(kvindex + PARTITION, partition); kvmeta.put(kvindex + KEYSTART, keystart);

kvmeta.put(kvindex + VALSTART, valstart);

kvmeta.put(kvindex + VALLEN, distanceTo(valstart, valend)); ...

}

這裏實際是將（K,V）序列化到了byte數組org.apache.hadoop.mapred.MapTask.MapOutputBuffer.kvbuffer 中，

並將（K,V）在內存中的位置信息以及其partition(相同partition的record由同一個reducer處理) 消息存在 kvmeta 中.

到此map的輸出都存在了內存中

2. 通過查找kvmeta的代碼索引，找到消費kvbuffer和kvmeta代碼，生產spillRecv到indexCacheList

可以找到在 org.apache.hadoop.mapred.MapTask.MapOutputBuffer.sortAndSpill() 中找到有使用，設置斷點，看到如下，

private void sortAndSpill() throws IOException, ClassNotFoundException, InterruptedException { ...

sorter.sort(MapOutputBuffer.this, mstart, mend, reporter);

...

for (int i = 0; i < partitions; ++i) {

...

if (combinerRunner == null) {

// spill directly DataInputBuffer key = new DataInputBuffer();

while (spindex < mend &&

kvmeta.get(offsetFor(spindex % maxRec) + PARTITION) == i) {

....

writer.append(key, value);

++spindex;

}

} ...

spillRec.putIndex(rec, i);

}

...

indexCacheList.add(spillRec);

...}

這裏有三個操作，

1. Sorter.sort ：是以partition 和key 來排序的,目的是聚合相同partition的record, 並以key的順序排列。

2. writer.append : 將序列化的record 寫入輸出流，這裏寫入到文件spill0.out

3. indexCacheList.add : 每個spillRec記錄某個spill out文件中包含的partition信息。

3. 查找消費indexCacheList的代碼，org.apache.hadoop.mapred.MapTask.MapOutputBuffer.mergeParts()

在此設置斷點，可以看到這裏我們只有一個spill文件，不需要merge，

這裏只是唯一的spillRec 寫入到到文件中, file.out.index

將spill0.out 重命名爲file.out，可以vim打開這個文件看到裏面存在順序號的字符。

private void mergeParts() throws IOException, InterruptedException, ClassNotFoundException {

...

sameVolRename(filename[0],

mapOutputFile.getOutputFileForWriteInVolume(filename[0]));...

indexCacheList.get(0).writeToFile(

mapOutputFile.getOutputIndexFileForWriteInVolume(filename[0]), job);

...}

總結如下：

1. map的輸出首先序列化到內存中kvbuffer，kvmeta

2. sortAndSpill 會將內存中的record寫入到文件中

3. merge將spill出的文件merge問一個文件file.out，並將每個文件中partition的信息寫入file.out.index

還沒分析的情況：

map 輸出大量數據，出現多個spill 文件的複雜情況的細節（1. 異步spill， 2. merge 多個文件）

hadoop源碼分析，map輸出

1. 單步跟蹤map中的context.write（生產kvbuffer 和kvmeta）

2. 通過查找kvmeta的代碼索引，找到消費kvbuffer和kvmeta代碼，生產spillRecv到indexCacheList

3. 查找消費indexCacheList的代碼，org.apache.hadoop.mapred.MapTask.MapOutputBuffer.mergeParts()

Power Automate Desktop 安裝完，登錄後老是提示one driver 錯誤

再談23種設計模式（3）：行爲型模式（學習筆記）

微前端學習筆記(4):從微前端到微模塊之EMP與hel-micro方案探索

微前端學習筆記（1）：微前端總體架構概述，從微服務發微

985 碩士程序員，空窗 4 個月沒有 Offer！

一文搞懂 Spring 循環依賴

賽博鬥地主——使用大語言模型扮演Agent智能體玩牌類遊戲。

VScode右鍵打開(添加到右鍵)

記一次 .NET某工控視覺自動化系統卡死分析

WindowsServer--SQL Server搭建主從同步實現讀寫分離 - 事務性分發

hadoop源碼分析，map輸出

hadoop源碼分析，map輸出

hadoop源碼分析，map輸出

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

hadoop源碼分析，map輸出

1. 單步跟蹤map中的context.write（生產kvbuffer 和kvmeta）

2. 通過查找kvmeta的代碼索引， 找到消費kvbuffer和kvmeta代碼，生產spillRecv到indexCacheList

3. 查找消費indexCacheList的代碼，org.apache.hadoop.mapred.MapTask.MapOutputBuffer.mergeParts()

2. 通過查找kvmeta的代碼索引，找到消費kvbuffer和kvmeta代碼，生產spillRecv到indexCacheList