Flink源碼分析系列文檔目錄
請點擊:Flink 源碼分析系列文檔目錄
前言
FLIP-27: Refactor Source Interface - Apache Flink - Apache Software Foundation提出了新的Source架構。該新架構的分析請參見Flink 源碼之新 Source 架構。針對這個新架構,Flink社區新推出了新的Kafka connector - KafkaSource
。老版本的實現FlinkKafkaConsumer
目前被標記爲Deprecated,不再推薦使用。本篇展開KafkaSource
的源代碼分析。
本篇包含4個部分的源代碼分析:
- KafkaSource創建
- 數據讀取
- 分區發現
- checkpoint
KafkaSource創建
如官網所示,編寫Flink消費Kafka場景應用,我們可以按照如下方式創建KafkaSource:
KafkaSource<String> source = KafkaSource.<String>builder()
.setBootstrapServers(brokers)
.setTopics("input-topic")
.setGroupId("my-group")
.setStartingOffsets(OffsetsInitializer.earliest())
.setValueOnlyDeserializer(new SimpleStringSchema())
.build();
env.fromSource(source, WatermarkStrategy.noWatermarks(), "Kafka Source");
env.fromSource生成了一個DataStreamSource
。DataStreamSource
對應了SourceTransformation
,然後經過SourceTransformationTranslator
翻譯成StreamGraph
的Source
節點,執行的時候對應的是SourceOperator
。SourceOperator
是新Source API對應的Operator。它直接和SourceReader交互。調用sourceReader.pollNext
方法拉取數據。這一連串邏輯與Kafka關係不大,不再展開介紹,瞭解即可。
最終,KafkaSourceBuilder
按照我們配置的參數,返回符合要求的kafkaSource
對象。
public KafkaSource<OUT> build() {
sanityCheck();
parseAndSetRequiredProperties();
return new KafkaSource<>(
subscriber,
startingOffsetsInitializer,
stoppingOffsetsInitializer,
boundedness,
deserializationSchema,
props);
}
KafkaSource
的createReader
方法生成KafkaSourceReader
。代碼如下:
@Internal
@Override
public SourceReader<OUT, KafkaPartitionSplit> createReader(SourceReaderContext readerContext)
throws Exception {
return createReader(readerContext, (ignore) -> {});
}
@VisibleForTesting
SourceReader<OUT, KafkaPartitionSplit> createReader(
SourceReaderContext readerContext, Consumer<Collection<String>> splitFinishedHook)
throws Exception {
// elementQueue用來存放從fetcher獲取到的ConsumerRecord
// reader從elementQueue讀取緩存的Kafka消息
FutureCompletingBlockingQueue<RecordsWithSplitIds<ConsumerRecord<byte[], byte[]>>>
elementsQueue = new FutureCompletingBlockingQueue<>();
// 初始化deserializationSchema
deserializationSchema.open(
new DeserializationSchema.InitializationContext() {
@Override
public MetricGroup getMetricGroup() {
return readerContext.metricGroup().addGroup("deserializer");
}
@Override
public UserCodeClassLoader getUserCodeClassLoader() {
return readerContext.getUserCodeClassLoader();
}
});
// 創建Kafka數據源監控
final KafkaSourceReaderMetrics kafkaSourceReaderMetrics =
new KafkaSourceReaderMetrics(readerContext.metricGroup());
// 創建一個工廠方法,用於創建KafkaPartitionSplitReader。它按照分區讀取Kafka消息
Supplier<KafkaPartitionSplitReader> splitReaderSupplier =
() -> new KafkaPartitionSplitReader(props, readerContext, kafkaSourceReaderMetrics);
KafkaRecordEmitter<OUT> recordEmitter = new KafkaRecordEmitter<>(deserializationSchema);
return new KafkaSourceReader<>(
elementsQueue,
new KafkaSourceFetcherManager(
elementsQueue, splitReaderSupplier::get, splitFinishedHook),
recordEmitter,
toConfiguration(props),
readerContext,
kafkaSourceReaderMetrics);
}
數據讀取流程
KafkaSourceFetcherManager
繼承了SingleThreadFetcherManager
。當發現數據分片的時候,它會獲取已有的SplitFetcher
,將split指派給它。如果沒有正在運行的fetcher,創建一個新的。
@Override
// 發現新的分片的時候調用這個方法
// 將分片指派給fetcher
public void addSplits(List<SplitT> splitsToAdd) {
SplitFetcher<E, SplitT> fetcher = getRunningFetcher();
if (fetcher == null) {
fetcher = createSplitFetcher();
// Add the splits to the fetchers.
fetcher.addSplits(splitsToAdd);
startFetcher(fetcher);
} else {
fetcher.addSplits(splitsToAdd);
}
}
然後我們分析fetcher如何拉取數據的。上面的startFetcher
方法啓動SplitFetcher
線程。
protected void startFetcher(SplitFetcher<E, SplitT> fetcher) {
executors.submit(fetcher);
}
SplitFetcher
用於執行從外部系統拉取數據的任務,它一直循環運行SplitFetchTask
。SplitFetchTask
有多個子類:
- AddSplitTask: 爲reader指派split的任務
- PauseOrResumeSplitsTask: 暫停或恢復Split讀取的任務
- FetchTask: 拉取數據到elemeQueue中
接下來分析SplitFetcher
的run
方法:
@Override
public void run() {
LOG.info("Starting split fetcher {}", id);
try {
// 循環執行runOnce方法
while (runOnce()) {
// nothing to do, everything is inside #runOnce.
}
} catch (Throwable t) {
errorHandler.accept(t);
} finally {
try {
splitReader.close();
} catch (Exception e) {
errorHandler.accept(e);
} finally {
LOG.info("Split fetcher {} exited.", id);
// This executes after possible errorHandler.accept(t). If these operations bear
// a happens-before relation, then we can checking side effect of
// errorHandler.accept(t)
// to know whether it happened after observing side effect of shutdownHook.run().
shutdownHook.run();
}
}
}
boolean runOnce() {
// first blocking call = get next task. blocks only if there are no active splits and queued
// tasks.
SplitFetcherTask task;
lock.lock();
try {
if (closed) {
return false;
}
// 重要邏輯在此
// 這裏從taskQueue中獲取一個任務
// 如果隊列中有積壓的任務,優先運行之
// 如果taskQueue爲空,檢查是否有已分配的split,如果有的話返回一個FetchTask
// FetchTask在SplitFetcher構造e時候被創建出來
task = getNextTaskUnsafe();
if (task == null) {
// (spurious) wakeup, so just repeat
return true;
}
LOG.debug("Prepare to run {}", task);
// store task for #wakeUp
this.runningTask = task;
} finally {
lock.unlock();
}
// execute the task outside of lock, so that it can be woken up
boolean taskFinished;
try {
// 執行task的run方法
taskFinished = task.run();
} catch (Exception e) {
throw new RuntimeException(
String.format(
"SplitFetcher thread %d received unexpected exception while polling the records",
id),
e);
}
// re-acquire lock as all post-processing steps, need it
lock.lock();
try {
this.runningTask = null;
processTaskResultUnsafe(task, taskFinished);
} finally {
lock.unlock();
}
return true;
}
用來拉取數據的SplitFetchTask
子類爲FetchTask
。它的run
方法代碼如下所示:
@Override
public boolean run() throws IOException {
try {
// 在wakeup狀態會跳過這一輪執行
if (!isWakenUp() && lastRecords == null) {
// 調用splitReader從split拉取一批數據
lastRecords = splitReader.fetch();
}
if (!isWakenUp()) {
// The order matters here. We must first put the last records into the queue.
// This ensures the handling of the fetched records is atomic to wakeup.
// 將讀取到的數據放入到elementQueue中
if (elementsQueue.put(fetcherIndex, lastRecords)) {
//如果有已經讀取完的split
if (!lastRecords.finishedSplits().isEmpty()) {
// The callback does not throw InterruptedException.
// 調用讀取完成回調函數
splitFinishedCallback.accept(lastRecords.finishedSplits());
}
lastRecords = null;
}
}
} catch (InterruptedException e) {
// this should only happen on shutdown
throw new IOException("Source fetch execution was interrupted", e);
} finally {
// clean up the potential wakeup effect. It is possible that the fetcher is waken up
// after the clean up. In that case, either the wakeup flag will be set or the
// running thread will be interrupted. The next invocation of run() will see that and
// just skip.
if (isWakenUp()) {
wakeup = false;
}
}
// The return value of fetch task does not matter.
return true;
}
上面代碼片段中splitReader.fetch()
對應的是KafkaPartitionSplitReader
的fetch
方法。
@Override
public RecordsWithSplitIds<ConsumerRecord<byte[], byte[]>> fetch() throws IOException {
ConsumerRecords<byte[], byte[]> consumerRecords;
try {
// 調用KafkaConsumer拉取一批消息,超時時間爲10s
consumerRecords = consumer.poll(Duration.ofMillis(POLL_TIMEOUT));
} catch (WakeupException | IllegalStateException e) {
// IllegalStateException will be thrown if the consumer is not assigned any partitions.
// This happens if all assigned partitions are invalid or empty (starting offset >=
// stopping offset). We just mark empty partitions as finished and return an empty
// record container, and this consumer will be closed by SplitFetcherManager.
// 如註釋所說,如果consumer沒有指定消費的partition,會拋出IllegalStateException
// 所有已分配的partition無效或者是爲空(起始offset >= 停止offset)的時候也會出現這種情況
// 返回空的KafkaPartitionSplitRecords,並且標記分區已完成
// 這個consumer稍後會被SplitFetcherManager關閉
KafkaPartitionSplitRecords recordsBySplits =
new KafkaPartitionSplitRecords(
ConsumerRecords.empty(), kafkaSourceReaderMetrics);
markEmptySplitsAsFinished(recordsBySplits);
return recordsBySplits;
}
// 將consumerRecords包裝在KafkaPartitionSplitRecords中返回
// KafkaPartitionSplitRecords具有pattition和record兩個iterator
KafkaPartitionSplitRecords recordsBySplits =
new KafkaPartitionSplitRecords(consumerRecords, kafkaSourceReaderMetrics);
List<TopicPartition> finishedPartitions = new ArrayList<>();
// 遍歷consumerRecords中的partition
for (TopicPartition tp : consumerRecords.partitions()) {
// 獲取分區停止offset
long stoppingOffset = getStoppingOffset(tp);
// 讀取這個partition的所有數據
final List<ConsumerRecord<byte[], byte[]>> recordsFromPartition =
consumerRecords.records(tp);
// 如果讀取到數據
if (recordsFromPartition.size() > 0) {
// 獲取該分區最後一條讀取到的數據
final ConsumerRecord<byte[], byte[]> lastRecord =
recordsFromPartition.get(recordsFromPartition.size() - 1);
// After processing a record with offset of "stoppingOffset - 1", the split reader
// should not continue fetching because the record with stoppingOffset may not
// exist. Keep polling will just block forever.
// 如果最後一條消息的offset大於等於stoppingOffset
// stopping使用consumer的endOffsets方法獲取,
// 設置recordsBySplits的結束offset
// 然後標記這個split爲已完成
if (lastRecord.offset() >= stoppingOffset - 1) {
recordsBySplits.setPartitionStoppingOffset(tp, stoppingOffset);
finishSplitAtRecord(
tp,
stoppingOffset,
lastRecord.offset(),
finishedPartitions,
recordsBySplits);
}
}
// Track this partition's record lag if it never appears before
// 添加kafka記錄延遲監控
kafkaSourceReaderMetrics.maybeAddRecordsLagMetric(consumer, tp);
}
// 將空的split標記爲已完成的split
markEmptySplitsAsFinished(recordsBySplits);
// Unassign the partitions that has finished.
// 不再記錄已完成分區記錄的延遲
// 取消分配這些分區
if (!finishedPartitions.isEmpty()) {
finishedPartitions.forEach(kafkaSourceReaderMetrics::removeRecordsLagMetric);
unassignPartitions(finishedPartitions);
}
// Update numBytesIn
// 更新已讀取的字節數監控數值
kafkaSourceReaderMetrics.updateNumBytesInCounter();
return recordsBySplits;
}
到這裏爲止,我們分析完了從KafkaConsumer讀取消息到並放置到ElementQueue的邏輯。接下來是Flink內部將ElementQueue中的數據讀取出來併發送到下游的過程。
SourceReaderBase
將數據從elementQueue
中讀出然後交給recordEmitter
。
SourceReaderBase
的getNextFetch
方法內容如下:
@Nullable
private RecordsWithSplitIds<E> getNextFetch(final ReaderOutput<T> output) {
splitFetcherManager.checkErrors();
LOG.trace("Getting next source data batch from queue");
// 從elementQueue中拿出一批數據
final RecordsWithSplitIds<E> recordsWithSplitId = elementsQueue.poll();
// 如果當前split沒有讀取到數據,並且沒有下一個split,返回null
if (recordsWithSplitId == null || !moveToNextSplit(recordsWithSplitId, output)) {
// No element available, set to available later if needed.
return null;
}
currentFetch = recordsWithSplitId;
return recordsWithSplitId;
}
getNextFetch
這個方法在pollNext
中調用。SourceOperator
調用reader的pollNext
方法,不斷拉取數據發送交給recordEmitter發送給下游。
@Override
public InputStatus pollNext(ReaderOutput<T> output) throws Exception {
// make sure we have a fetch we are working on, or move to the next
RecordsWithSplitIds<E> recordsWithSplitId = this.currentFetch;
if (recordsWithSplitId == null) {
recordsWithSplitId = getNextFetch(output);
if (recordsWithSplitId == null) {
return trace(finishedOrAvailableLater());
}
}
// we need to loop here, because we may have to go across splits
while (true) {
// Process one record.
final E record = recordsWithSplitId.nextRecordFromSplit();
if (record != null) {
// emit the record.
numRecordsInCounter.inc(1);
recordEmitter.emitRecord(record, currentSplitOutput, currentSplitContext.state);
LOG.trace("Emitted record: {}", record);
// We always emit MORE_AVAILABLE here, even though we do not strictly know whether
// more is available. If nothing more is available, the next invocation will find
// this out and return the correct status.
// That means we emit the occasional 'false positive' for availability, but this
// saves us doing checks for every record. Ultimately, this is cheaper.
return trace(InputStatus.MORE_AVAILABLE);
} else if (!moveToNextSplit(recordsWithSplitId, output)) {
// The fetch is done and we just discovered that and have not emitted anything, yet.
// We need to move to the next fetch. As a shortcut, we call pollNext() here again,
// rather than emitting nothing and waiting for the caller to call us again.
return pollNext(output);
}
}
}
最後我們一路分析到KafkaRecordEmitter
的emitRecord
方法。它把接收到的kafka消息逐條反序列化之後,發送給下游output。接着傳遞給下游算子。
@Override
public void emitRecord(
ConsumerRecord<byte[], byte[]> consumerRecord,
SourceOutput<T> output,
KafkaPartitionSplitState splitState)
throws Exception {
try {
sourceOutputWrapper.setSourceOutput(output);
sourceOutputWrapper.setTimestamp(consumerRecord.timestamp());
deserializationSchema.deserialize(consumerRecord, sourceOutputWrapper);
splitState.setCurrentOffset(consumerRecord.offset() + 1);
} catch (Exception e) {
throw new IOException("Failed to deserialize consumer record due to", e);
}
}
分區發現
Flink KafkaSource支持按照配置的規則(topic列表,topic正則表達式或者直接指定分區),以定時任務的形式週期掃描Kafka分區,從而實現Kafka分區動態發現。
KafkaSourceEnumerator
的start
方法創建出KafkaAdminClient
。然後根據partitionDiscoveryIntervalMs
(分區自動發現間隔時間),確定是否週期調用分區發現邏輯。
@Override
public void start() {
// 創建Kafka admin client
adminClient = getKafkaAdminClient();
// 如果配置了分區自動發現時間間隔
if (partitionDiscoveryIntervalMs > 0) {
LOG.info(
"Starting the KafkaSourceEnumerator for consumer group {} "
+ "with partition discovery interval of {} ms.",
consumerGroupId,
partitionDiscoveryIntervalMs);
// 週期調用getSubscribedTopicPartitions和checkPartitionChanges兩個方法
context.callAsync(
this::getSubscribedTopicPartitions,
this::checkPartitionChanges,
0,
partitionDiscoveryIntervalMs);
} else {
// 否則只在啓動的時候調用一次
LOG.info(
"Starting the KafkaSourceEnumerator for consumer group {} "
+ "without periodic partition discovery.",
consumerGroupId);
context.callAsync(this::getSubscribedTopicPartitions, this::checkPartitionChanges);
}
}
getSubscribedTopicPartitions
方法:
private Set<TopicPartition> getSubscribedTopicPartitions() {
return subscriber.getSubscribedTopicPartitions(adminClient);
}
這個方法調用KafkaSubscriber
,根據配置的條件,獲取訂閱的partition。
KafkaSubscriber
具有3個子類,分別對應不同的分區發現規則:
- PartitionSetSubscriber: 通過
KafkaSourceBuilder
的setPartitions
方法創建,直接根據partition名稱訂閱內容。 - TopicListSubscriber: 根據topic列表獲取訂閱的partition。使用
KafkaSourceBuilder
的setTopics
可以訂閱一系列的topic,使用的subscriber就是這個。 - TopicPatternSubscriber: 使用正則表達式匹配topic名稱的方式獲取訂閱的partition。使用
KafkaSourceBuilder
的setTopicPattern
方法的時候會創建此subscriber。
接下來以TopicListSubscriber
爲例,分析獲取訂閱partiton的邏輯。
@Override
public Set<TopicPartition> getSubscribedTopicPartitions(AdminClient adminClient) {
LOG.debug("Fetching descriptions for topics: {}", topics);
// 使用admin client讀取Kafka topic的元數據
// 包含指定topic對應的分區信息
final Map<String, TopicDescription> topicMetadata =
getTopicMetadata(adminClient, new HashSet<>(topics));
// 將各個分區的partition信息加入到subscribedPartitions集合,然後返回
Set<TopicPartition> subscribedPartitions = new HashSet<>();
for (TopicDescription topic : topicMetadata.values()) {
for (TopicPartitionInfo partition : topic.partitions()) {
subscribedPartitions.add(new TopicPartition(topic.name(), partition.partition()));
}
}
return subscribedPartitions;
}
獲取訂閱分區的邏輯不是特別複雜,其他兩個subscriber
的邏輯這裏不再分析。
getSubscribedTopicPartitions
方法的返回值和異常(如果拋出異常的話)將會傳遞給checkPartitionChange
方法。它將檢測分區信息是否發生變更。代碼邏輯如下:
private void checkPartitionChanges(Set<TopicPartition> fetchedPartitions, Throwable t) {
if (t != null) {
throw new FlinkRuntimeException(
"Failed to list subscribed topic partitions due to ", t);
}
// 檢測分區變更情況
final PartitionChange partitionChange = getPartitionChange(fetchedPartitions);
// 如果沒有變更,直接返回
if (partitionChange.isEmpty()) {
return;
}
// 如果檢測到變更,調用initializePartitionSplits和handlePartitionSplitChanges方法
context.callAsync(
() -> initializePartitionSplits(partitionChange),
this::handlePartitionSplitChanges);
}
@VisibleForTesting
PartitionChange getPartitionChange(Set<TopicPartition> fetchedPartitions) {
// 保存被移除的分區
final Set<TopicPartition> removedPartitions = new HashSet<>();
Consumer<TopicPartition> dedupOrMarkAsRemoved =
(tp) -> {
if (!fetchedPartitions.remove(tp)) {
removedPartitions.add(tp);
}
};
// 如果分區在assignedPartitions(已分配分區)存在,在fetchedPartitions中不存在,說明分區已經移出
// 將其加入到removedPartitions中
assignedPartitions.forEach(dedupOrMarkAsRemoved);
// pendingPartitionSplitAssignment爲上輪發現但是還沒有分配給reader讀取的分區
// 從pendingPartitionSplitAssignment中找到被移除的分區
pendingPartitionSplitAssignment.forEach(
(reader, splits) ->
splits.forEach(
split -> dedupOrMarkAsRemoved.accept(split.getTopicPartition())));
// 如果fetchedPartitions還有分區沒有remove掉,說明有新發現的分區
if (!fetchedPartitions.isEmpty()) {
LOG.info("Discovered new partitions: {}", fetchedPartitions);
}
if (!removedPartitions.isEmpty()) {
LOG.info("Discovered removed partitions: {}", removedPartitions);
}
// 包裝新增加的分區和移除的分區到PartitionChange中返回
return new PartitionChange(fetchedPartitions, removedPartitions);
}
對比完新發現的分區和原本訂閱的分區之後,接下來需要對這些變更做出響應。
initializePartitionSplits
方法將分區變更信息包裝爲PartitionSplitChange
。這個對象記錄了新增加的分區和移除的分區。和PartitionChange
不同的是,PartitionSplitChange
包含的新增分區的類型爲KafkaPartitionSplit
,它額外保存了分區的起始和終止offset。
private PartitionSplitChange initializePartitionSplits(PartitionChange partitionChange) {
// 獲取新增的分區
Set<TopicPartition> newPartitions =
Collections.unmodifiableSet(partitionChange.getNewPartitions());
// 獲取分區offset獲取器
OffsetsInitializer.PartitionOffsetsRetriever offsetsRetriever = getOffsetsRetriever();
// 獲取起始offset
Map<TopicPartition, Long> startingOffsets =
startingOffsetInitializer.getPartitionOffsets(newPartitions, offsetsRetriever);
// 獲取終止offset
Map<TopicPartition, Long> stoppingOffsets =
stoppingOffsetInitializer.getPartitionOffsets(newPartitions, offsetsRetriever);
Set<KafkaPartitionSplit> partitionSplits = new HashSet<>(newPartitions.size());
// 將每個分區對應的starting offset和stopping offset包裝起來
for (TopicPartition tp : newPartitions) {
Long startingOffset = startingOffsets.get(tp);
long stoppingOffset =
stoppingOffsets.getOrDefault(tp, KafkaPartitionSplit.NO_STOPPING_OFFSET);
partitionSplits.add(new KafkaPartitionSplit(tp, startingOffset, stoppingOffset));
}
// 返回結果
return new PartitionSplitChange(partitionSplits, partitionChange.getRemovedPartitions());
}
上面的方法的關鍵邏輯是獲取各個分區的起始offset(startingOffsetInitializer
)和終止offset(KafkaSourceBuilder
)。
startingOffsetInitializer
在KafkaSourceBuilder
中創建,默認爲OffsetsInitializer.earliest()
。代碼如下:
static OffsetsInitializer earliest() {
return new ReaderHandledOffsetsInitializer(
KafkaPartitionSplit.EARLIEST_OFFSET, OffsetResetStrategy.EARLIEST);
}
它創建出ReaderHandledOffsetsInitializer
對象,含義是對於所有新發現的topic,從它們最開頭的地方開始讀取。
ReaderHandledOffsetsInitializer
的getPartitionOffsets
方法代碼內容如下。它將所有的分區offset設置爲startingOffset
,結合前面的場景,即KafkaPartitionSplit.EARLIEST_OFFSET
。
@Override
public Map<TopicPartition, Long> getPartitionOffsets(
Collection<TopicPartition> partitions,
PartitionOffsetsRetriever partitionOffsetsRetriever) {
Map<TopicPartition, Long> initialOffsets = new HashMap<>();
for (TopicPartition tp : partitions) {
initialOffsets.put(tp, startingOffset);
}
return initialOffsets;
}
對於stoppingOffsetInitializer
,KafkaSourceBuilder
創建的默認爲NoStoppingOffsetsInitializer
。含義爲沒有終止offset,針對unbounded(無界)kafka數據流。它的代碼很少,這裏就不再分析了。
我們回到應對分區變更的方法handlePartitionSplitChanges
。這個方法將新發現的分區分配給pending和已註冊的reader。
private void handlePartitionSplitChanges(
PartitionSplitChange partitionSplitChange, Throwable t) {
if (t != null) {
throw new FlinkRuntimeException("Failed to initialize partition splits due to ", t);
}
if (partitionDiscoveryIntervalMs < 0) {
LOG.debug("Partition discovery is disabled.");
noMoreNewPartitionSplits = true;
}
// TODO: Handle removed partitions.
addPartitionSplitChangeToPendingAssignments(partitionSplitChange.newPartitionSplits);
assignPendingPartitionSplits(context.registeredReaders().keySet());
}
addPartitionSplitChangeToPendingAssignments
將分區加入到待讀取(pending)集合中。
private void addPartitionSplitChangeToPendingAssignments(
Collection<KafkaPartitionSplit> newPartitionSplits) {
int numReaders = context.currentParallelism();
for (KafkaPartitionSplit split : newPartitionSplits) {
// 將這些分區均勻分配給所有的reader
int ownerReader = getSplitOwner(split.getTopicPartition(), numReaders);
pendingPartitionSplitAssignment
.computeIfAbsent(ownerReader, r -> new HashSet<>())
.add(split);
}
LOG.debug(
"Assigned {} to {} readers of consumer group {}.",
newPartitionSplits,
numReaders,
consumerGroupId);
}
assignPendingPartitionSplits
方法分配分區給reader。它的邏輯分析如下:
private void assignPendingPartitionSplits(Set<Integer> pendingReaders) {
Map<Integer, List<KafkaPartitionSplit>> incrementalAssignment = new HashMap<>();
// Check if there's any pending splits for given readers
for (int pendingReader : pendingReaders) {
// 檢查reader是否已在SourceCoordinator中註冊
checkReaderRegistered(pendingReader);
// Remove pending assignment for the reader
// 獲取這個reader對應的所有分配給它的分區,然後從pendingPartitionSplitAssignment中移除
final Set<KafkaPartitionSplit> pendingAssignmentForReader =
pendingPartitionSplitAssignment.remove(pendingReader);
// 如果有分配給這個reader的分區,將他們加入到incrementalAssignment中
if (pendingAssignmentForReader != null && !pendingAssignmentForReader.isEmpty()) {
// Put pending assignment into incremental assignment
incrementalAssignment
.computeIfAbsent(pendingReader, (ignored) -> new ArrayList<>())
.addAll(pendingAssignmentForReader);
// Mark pending partitions as already assigned
// 標記這些分區爲已分配
pendingAssignmentForReader.forEach(
split -> assignedPartitions.add(split.getTopicPartition()));
}
}
// Assign pending splits to readers
// 將這些分區分配給reader
if (!incrementalAssignment.isEmpty()) {
LOG.info("Assigning splits to readers {}", incrementalAssignment);
context.assignSplits(new SplitsAssignment<>(incrementalAssignment));
}
// If periodically partition discovery is disabled and the initializing discovery has done,
// signal NoMoreSplitsEvent to pending readers
// 如果沒有新的分片(分區發現被關閉),並且設置爲有界模式
// 給reader發送沒有更多分片信號(signalNoMoreSplits)
if (noMoreNewPartitionSplits && boundedness == Boundedness.BOUNDED) {
LOG.debug(
"No more KafkaPartitionSplits to assign. Sending NoMoreSplitsEvent to reader {}"
+ " in consumer group {}.",
pendingReaders,
consumerGroupId);
pendingReaders.forEach(context::signalNoMoreSplits);
}
}
調用assignPendingPartitionSplits
方法的地方有三處:
- addSplitsBack: 某個reader執行失敗,在上次成功checkpoint之後分配給這個reader的split需要再添加回
SplitEnumerator
中。 - addReader: 增加新的reader。需要給新的reader分配split。
- handlePartitionSplitChanges: 上面介紹的檢測到分區變更的時候,需要爲reader分配新發現的分區。
接着我們關心的問題是這些分片是如何添加給SplitEnumerator
的。我們展開分析context.assignSplits
調用。這裏的context
實現類爲SourceCoordinatorContext
。繼續分析SourceCoordinatorContext::assignSplits
方法代碼:
@Override
public void assignSplits(SplitsAssignment<SplitT> assignment) {
// Ensure the split assignment is done by the coordinator executor.
// 在SourceCoordinator線程中調用
callInCoordinatorThread(
() -> {
// Ensure all the subtasks in the assignment have registered.
// 逐個檢查需要分配的split所屬的reader是否已註冊過
assignment
.assignment()
.forEach(
(id, splits) -> {
if (!registeredReaders.containsKey(id)) {
throw new IllegalArgumentException(
String.format(
"Cannot assign splits %s to subtask %d because the subtask is not registered.",
splits, id));
}
});
// 記錄已分配的assignment(加入到尚未checkpoint的assignment集合中)
assignmentTracker.recordSplitAssignment(assignment);
// 分配split
assignSplitsToAttempts(assignment);
return null;
},
String.format("Failed to assign splits %s due to ", assignment));
}
assignSplitsToAttempts
有好幾個重載方法。一路跟隨到最後,它創建出了AddSplitEvent
對象,通過OperatorCoordinator
發送這個事件給SourceOperator
。代碼如下所示:
private void assignSplitsToAttempts(SplitsAssignment<SplitT> assignment) {
assignment.assignment().forEach((index, splits) -> assignSplitsToAttempts(index, splits));
}
private void assignSplitsToAttempts(int subtaskIndex, List<SplitT> splits) {
getRegisteredAttempts(subtaskIndex)
.forEach(attempt -> assignSplitsToAttempt(subtaskIndex, attempt, splits));
}
private void assignSplitsToAttempt(int subtaskIndex, int attemptNumber, List<SplitT> splits) {
if (splits.isEmpty()) {
return;
}
checkAttemptReaderReady(subtaskIndex, attemptNumber);
final AddSplitEvent<SplitT> addSplitEvent;
try {
// 創建AddSplitEvent(添加split事件)
addSplitEvent = new AddSplitEvent<>(splits, splitSerializer);
} catch (IOException e) {
throw new FlinkRuntimeException("Failed to serialize splits.", e);
}
final OperatorCoordinator.SubtaskGateway gateway =
subtaskGateways.getGatewayAndCheckReady(subtaskIndex, attemptNumber);
// 將事件發送給subtaskIndex對應的SourceOperator
gateway.sendEvent(addSplitEvent);
}
gateway.sendEvent() -> SourceOperator::handleOperatorEvent
網絡通信之間的過程這裏不再分析了。我們查看SourceOperator
接收event的方法handleOperatorEvent
,內容如下:
public void handleOperatorEvent(OperatorEvent event) {
if (event instanceof WatermarkAlignmentEvent) {
updateMaxDesiredWatermark((WatermarkAlignmentEvent) event);
checkWatermarkAlignment();
checkSplitWatermarkAlignment();
} else if (event instanceof AddSplitEvent) {
handleAddSplitsEvent(((AddSplitEvent<SplitT>) event));
} else if (event instanceof SourceEventWrapper) {
sourceReader.handleSourceEvents(((SourceEventWrapper) event).getSourceEvent());
} else if (event instanceof NoMoreSplitsEvent) {
sourceReader.notifyNoMoreSplits();
} else {
throw new IllegalStateException("Received unexpected operator event " + event);
}
}
如果接收到的事件類型爲AddSplitEvent
,調用handleAddSplitsEvent
方法。分析如下:
private void handleAddSplitsEvent(AddSplitEvent<SplitT> event) {
try {
// 反序列化得到split信息
List<SplitT> newSplits = event.splits(splitSerializer);
numSplits += newSplits.size();
// 如果下游output還沒有初始化,加入到pending集合中緩存起來
// 否則創建output,將split分配給這些output
if (operatingMode == OperatingMode.OUTPUT_NOT_INITIALIZED) {
// For splits arrived before the main output is initialized, store them into the
// pending list. Outputs of these splits will be created once the main output is
// ready.
outputPendingSplits.addAll(newSplits);
} else {
// Create output directly for new splits if the main output is already initialized.
createOutputForSplits(newSplits);
}
// 將split添加到sourceReader
sourceReader.addSplits(newSplits);
} catch (IOException e) {
throw new FlinkRuntimeException("Failed to deserialize the splits.", e);
}
}
最後我們一路跟蹤到SourceReaderBase
的addSplits
方法。
@Override
public void addSplits(List<SplitT> splits) {
LOG.info("Adding split(s) to reader: {}", splits);
// Initialize the state for each split.
splits.forEach(
s ->
splitStates.put(
s.splitId(), new SplitContext<>(s.splitId(), initializedState(s))));
// Hand over the splits to the split fetcher to start fetch.
splitFetcherManager.addSplits(splits);
}
它把split交給splitFetcherManager
執行。在本篇KafkaSource
環境下它的實現類爲KafkaSourceFetcherManager
。它的addSplits
方法位於父類SingleThreadFetcherManager
中。
分析到這裏,我們回到了上一節"數據讀取流程"的開頭"添加分片"方法。至此KafkaSource
分區發現邏輯分析完畢。
Checkpoint邏輯
KafkaSourceReader
的snapshotState
方法返回當前需要checkpoint的分片信息,即Reader分配的分片。如果用戶配置了commit.offsets.on.checkpoint=true
,保存各個分片對應的分區和offset分區到offsetsToCommit中。
@Override
public List<KafkaPartitionSplit> snapshotState(long checkpointId) {
// 獲取分配給當前Reader的分片(checkpointId參數實際上沒有用到)
List<KafkaPartitionSplit> splits = super.snapshotState(checkpointId);
// 由配置項commit.offsets.on.checkpoint決定
// 是否在checkpoint的時候,提交offset
if (!commitOffsetsOnCheckpoint) {
return splits;
}
// 下面邏輯只有在開啓commit.offsets.on.checkpoint的時候纔會執行
// offsetToCommit保存了需要commit的offset信息
// 是一個Map<checkpointID, Map<partition, offset>>數據結構
// 如果當前Reader沒有分片,並且也沒有讀取完畢的分片
// offsetsToCommit記錄checkpoint id對應一個空的map
if (splits.isEmpty() && offsetsOfFinishedSplits.isEmpty()) {
offsetsToCommit.put(checkpointId, Collections.emptyMap());
} else {
// 爲當前checkpoint id創建一個offsetMap,保存在offsetsToCommit中
Map<TopicPartition, OffsetAndMetadata> offsetsMap =
offsetsToCommit.computeIfAbsent(checkpointId, id -> new HashMap<>());
// Put the offsets of the active splits.
// 遍歷splits,保存split對應的分區和offset到offsetsMap中
for (KafkaPartitionSplit split : splits) {
// If the checkpoint is triggered before the partition starting offsets
// is retrieved, do not commit the offsets for those partitions.
if (split.getStartingOffset() >= 0) {
offsetsMap.put(
split.getTopicPartition(),
new OffsetAndMetadata(split.getStartingOffset()));
}
}
// 保存所有完成讀取的split的partition和offset信息
// Put offsets of all the finished splits.
offsetsMap.putAll(offsetsOfFinishedSplits);
}
return splits;
}
notifyCheckpointComplete
方法。該方法在checkpoint完畢的時候執行。由SourceCoordinator
發送checkpoint完畢通知。在這個方法中Kafka數據源提交Kafka offset。
@Override
public void notifyCheckpointComplete(long checkpointId) throws Exception {
LOG.debug("Committing offsets for checkpoint {}", checkpointId);
// 同上,如果沒有啓用checkpoint時候提交offset的配置,方法退出,什麼也不做
if (!commitOffsetsOnCheckpoint) {
return;
}
// 從offsetsToCommit中獲取當前checkpoint需要提交的分區offset信息
Map<TopicPartition, OffsetAndMetadata> committedPartitions =
offsetsToCommit.get(checkpointId);
// 如果爲空,退出
if (committedPartitions == null) {
LOG.debug(
"Offsets for checkpoint {} either do not exist or have already been committed.",
checkpointId);
return;
}
// 調用KafkaSourceFetcherManager,提交offset到kafka
// 稍後分析
((KafkaSourceFetcherManager) splitFetcherManager)
.commitOffsets(
committedPartitions,
(ignored, e) -> {
// The offset commit here is needed by the external monitoring. It won't
// break Flink job's correctness if we fail to commit the offset here.
// 這裏是提交offset的回調函數
// 如果遇到錯誤,監控指標記錄下失敗的提交
if (e != null) {
kafkaSourceReaderMetrics.recordFailedCommit();
LOG.warn(
"Failed to commit consumer offsets for checkpoint {}",
checkpointId,
e);
} else {
LOG.debug(
"Successfully committed offsets for checkpoint {}",
checkpointId);
// 監控指標記錄成功的提交
kafkaSourceReaderMetrics.recordSucceededCommit();
// If the finished topic partition has been committed, we remove it
// from the offsets of the finished splits map.
committedPartitions.forEach(
(tp, offset) ->
kafkaSourceReaderMetrics.recordCommittedOffset(
tp, offset.offset()));
// 由於offset已提交,從已完成split集合中移除
offsetsOfFinishedSplits
.entrySet()
.removeIf(
entry ->
committedPartitions.containsKey(
entry.getKey()));
// 移除當前以及之前的checkpoint id對應的offset信息,因爲已經commit過,無需再保存
while (!offsetsToCommit.isEmpty()
&& offsetsToCommit.firstKey() <= checkpointId) {
offsetsToCommit.remove(offsetsToCommit.firstKey());
}
}
});
}
接下來我們關注KafkaSourceFetcherManager
。這個類負責向KafkaConsumer
提交offset,邏輯對應commitOffsets
方法,內容如下:
public void commitOffsets(
Map<TopicPartition, OffsetAndMetadata> offsetsToCommit, OffsetCommitCallback callback) {
LOG.debug("Committing offsets {}", offsetsToCommit);
// 如果沒有offset需要commit,返回
if (offsetsToCommit.isEmpty()) {
return;
}
// 獲取正在運行的SplitFetcher
SplitFetcher<ConsumerRecord<byte[], byte[]>, KafkaPartitionSplit> splitFetcher =
fetchers.get(0);
if (splitFetcher != null) {
// The fetcher thread is still running. This should be the majority of the cases.
// 如果fetcher仍在運行,創建提交offset的任務,加入隊列
enqueueOffsetsCommitTask(splitFetcher, offsetsToCommit, callback);
} else {
// 如果沒有SplitFetcher運行,創建一個新的SplitFetcher
// 和上面異常創建任務之後,啓動這個SplitFetcher
splitFetcher = createSplitFetcher();
enqueueOffsetsCommitTask(splitFetcher, offsetsToCommit, callback);
startFetcher(splitFetcher);
}
}
繼續分析創建offset提交任務的方法。代碼如下:
private void enqueueOffsetsCommitTask(
SplitFetcher<ConsumerRecord<byte[], byte[]>, KafkaPartitionSplit> splitFetcher,
Map<TopicPartition, OffsetAndMetadata> offsetsToCommit,
OffsetCommitCallback callback) {
// 獲取splitFetcher對應的KafkaReader
KafkaPartitionSplitReader kafkaReader =
(KafkaPartitionSplitReader) splitFetcher.getSplitReader();
爲fetcher創建一個SplitFetcherTask
splitFetcher.enqueueTask(
new SplitFetcherTask() {
@Override
public boolean run() throws IOException {
kafkaReader.notifyCheckpointComplete(offsetsToCommit, callback);
return true;
}
@Override
public void wakeUp() {}
});
}
到此,一個SplitFetcherTask
已被添加到SplitFetcher
的taskQueue
中。根據我們在前面"數據讀取流程"分析的結論,SplitFetcher
通過runOnce
方法逐個讀取taskQueue
中排隊的任務執行。當它取出SplitFetcherTask
時,會運行它的run
方法。調用kafkaReader.notifyCheckpointComplete
方法。這個方法調用KafkaConsumer
的異步提交offset方法commitAsync
。
public void notifyCheckpointComplete(
Map<TopicPartition, OffsetAndMetadata> offsetsToCommit,
OffsetCommitCallback offsetCommitCallback) {
consumer.commitAsync(offsetsToCommit, offsetCommitCallback);
}
到這裏,KafkaSource checkpoint提交offset的過程分析完畢。
本博客爲作者原創,歡迎大家參與討論和批評指正。如需轉載請註明出處。