Flink源碼分析系列文檔目錄

前言

FLIP-27: Refactor Source Interface - Apache Flink - Apache Software Foundation提出了新的Source架構。該新架構的分析請參見Flink 源碼之新 Source 架構。針對這個新架構，Flink社區新推出了新的Kafka connector - KafkaSource。老版本的實現FlinkKafkaConsumer目前被標記爲Deprecated，不再推薦使用。本篇展開KafkaSource的源代碼分析。

本篇包含4個部分的源代碼分析：

KafkaSource創建
數據讀取
分區發現
checkpoint

KafkaSource創建

如官網所示，編寫Flink消費Kafka場景應用，我們可以按照如下方式創建KafkaSource：

KafkaSource<String> source = KafkaSource.<String>builder()
    .setBootstrapServers(brokers)
    .setTopics("input-topic")
    .setGroupId("my-group")
    .setStartingOffsets(OffsetsInitializer.earliest())
    .setValueOnlyDeserializer(new SimpleStringSchema())
    .build();

env.fromSource(source, WatermarkStrategy.noWatermarks(), "Kafka Source");

env.fromSource生成了一個DataStreamSource。DataStreamSource對應了SourceTransformation，然後經過SourceTransformationTranslator翻譯成StreamGraph的Source節點，執行的時候對應的是SourceOperator。SourceOperator是新Source API對應的Operator。它直接和SourceReader交互。調用sourceReader.pollNext方法拉取數據。這一連串邏輯與Kafka關係不大，不再展開介紹，瞭解即可。

最終，KafkaSourceBuilder按照我們配置的參數，返回符合要求的kafkaSource對象。

public KafkaSource<OUT> build() {
    sanityCheck();
    parseAndSetRequiredProperties();
    return new KafkaSource<>(
            subscriber,
            startingOffsetsInitializer,
            stoppingOffsetsInitializer,
            boundedness,
            deserializationSchema,
            props);
}

KafkaSource的createReader方法生成KafkaSourceReader。代碼如下：

@Internal
@Override
public SourceReader<OUT, KafkaPartitionSplit> createReader(SourceReaderContext readerContext)
        throws Exception {
    return createReader(readerContext, (ignore) -> {});
}

@VisibleForTesting
SourceReader<OUT, KafkaPartitionSplit> createReader(
        SourceReaderContext readerContext, Consumer<Collection<String>> splitFinishedHook)
        throws Exception {
    // elementQueue用來存放從fetcher獲取到的ConsumerRecord
    // reader從elementQueue讀取緩存的Kafka消息
    FutureCompletingBlockingQueue<RecordsWithSplitIds<ConsumerRecord<byte[], byte[]>>>
            elementsQueue = new FutureCompletingBlockingQueue<>();
    // 初始化deserializationSchema
    deserializationSchema.open(
            new DeserializationSchema.InitializationContext() {
                @Override
                public MetricGroup getMetricGroup() {
                    return readerContext.metricGroup().addGroup("deserializer");
                }

                @Override
                public UserCodeClassLoader getUserCodeClassLoader() {
                    return readerContext.getUserCodeClassLoader();
                }
            });
    // 創建Kafka數據源監控
    final KafkaSourceReaderMetrics kafkaSourceReaderMetrics =
            new KafkaSourceReaderMetrics(readerContext.metricGroup());

    // 創建一個工廠方法，用於創建KafkaPartitionSplitReader。它按照分區讀取Kafka消息
    Supplier<KafkaPartitionSplitReader> splitReaderSupplier =
            () -> new KafkaPartitionSplitReader(props, readerContext, kafkaSourceReaderMetrics);
    KafkaRecordEmitter<OUT> recordEmitter = new KafkaRecordEmitter<>(deserializationSchema);

    return new KafkaSourceReader<>(
            elementsQueue,
            new KafkaSourceFetcherManager(
                    elementsQueue, splitReaderSupplier::get, splitFinishedHook),
            recordEmitter,
            toConfiguration(props),
            readerContext,
            kafkaSourceReaderMetrics);
}

數據讀取流程

KafkaSourceFetcherManager繼承了SingleThreadFetcherManager。當發現數據分片的時候，它會獲取已有的SplitFetcher，將split指派給它。如果沒有正在運行的fetcher，創建一個新的。

@Override
// 發現新的分片的時候調用這個方法
// 將分片指派給fetcher
public void addSplits(List<SplitT> splitsToAdd) {
    SplitFetcher<E, SplitT> fetcher = getRunningFetcher();
    if (fetcher == null) {
        fetcher = createSplitFetcher();
        // Add the splits to the fetchers.
        fetcher.addSplits(splitsToAdd);
        startFetcher(fetcher);
    } else {
        fetcher.addSplits(splitsToAdd);
    }
}

然後我們分析fetcher如何拉取數據的。上面的startFetcher方法啓動SplitFetcher線程。

protected void startFetcher(SplitFetcher<E, SplitT> fetcher) {
    executors.submit(fetcher);
}

SplitFetcher用於執行從外部系統拉取數據的任務，它一直循環運行SplitFetchTask。SplitFetchTask有多個子類：

AddSplitTask: 爲reader指派split的任務
PauseOrResumeSplitsTask: 暫停或恢復Split讀取的任務
FetchTask: 拉取數據到elemeQueue中

接下來分析SplitFetcher的run方法：

@Override
public void run() {
    LOG.info("Starting split fetcher {}", id);
    try {
        // 循環執行runOnce方法
        while (runOnce()) {
            // nothing to do, everything is inside #runOnce.
        }
    } catch (Throwable t) {
        errorHandler.accept(t);
    } finally {
        try {
            splitReader.close();
        } catch (Exception e) {
            errorHandler.accept(e);
        } finally {
            LOG.info("Split fetcher {} exited.", id);
            // This executes after possible errorHandler.accept(t). If these operations bear
            // a happens-before relation, then we can checking side effect of
            // errorHandler.accept(t)
            // to know whether it happened after observing side effect of shutdownHook.run().
            shutdownHook.run();
        }
    }
}

boolean runOnce() {
    // first blocking call = get next task. blocks only if there are no active splits and queued
    // tasks.
    SplitFetcherTask task;
    lock.lock();
    try {
        if (closed) {
            return false;
        }

        // 重要邏輯在此
        // 這裏從taskQueue中獲取一個任務
        // 如果隊列中有積壓的任務，優先運行之
        // 如果taskQueue爲空，檢查是否有已分配的split，如果有的話返回一個FetchTask
        // FetchTask在SplitFetcher構造e時候被創建出來
        task = getNextTaskUnsafe();
        if (task == null) {
            // (spurious) wakeup, so just repeat
            return true;
        }

        LOG.debug("Prepare to run {}", task);
        // store task for #wakeUp
        this.runningTask = task;
    } finally {
        lock.unlock();
    }

    // execute the task outside of lock, so that it can be woken up
    boolean taskFinished;
    try {
        // 執行task的run方法
        taskFinished = task.run();
    } catch (Exception e) {
        throw new RuntimeException(
                String.format(
                        "SplitFetcher thread %d received unexpected exception while polling the records",
                        id),
                e);
    }

    // re-acquire lock as all post-processing steps, need it
    lock.lock();
    try {
        this.runningTask = null;
        processTaskResultUnsafe(task, taskFinished);
    } finally {
        lock.unlock();
    }
    return true;
}

用來拉取數據的SplitFetchTask子類爲FetchTask。它的run方法代碼如下所示：

@Override
public boolean run() throws IOException {
    try {
        // 在wakeup狀態會跳過這一輪執行
        if (!isWakenUp() && lastRecords == null) {
            // 調用splitReader從split拉取一批數據
            lastRecords = splitReader.fetch();
        }

        if (!isWakenUp()) {
            // The order matters here. We must first put the last records into the queue.
            // This ensures the handling of the fetched records is atomic to wakeup.
            // 將讀取到的數據放入到elementQueue中
            if (elementsQueue.put(fetcherIndex, lastRecords)) {
                //如果有已經讀取完的split
                if (!lastRecords.finishedSplits().isEmpty()) {
                    // The callback does not throw InterruptedException.
                    // 調用讀取完成回調函數
                    splitFinishedCallback.accept(lastRecords.finishedSplits());
                }
                lastRecords = null;
            }
        }
    } catch (InterruptedException e) {
        // this should only happen on shutdown
        throw new IOException("Source fetch execution was interrupted", e);
    } finally {
        // clean up the potential wakeup effect. It is possible that the fetcher is waken up
        // after the clean up. In that case, either the wakeup flag will be set or the
        // running thread will be interrupted. The next invocation of run() will see that and
        // just skip.
        if (isWakenUp()) {
            wakeup = false;
        }
    }
    // The return value of fetch task does not matter.
    return true;
}

上面代碼片段中splitReader.fetch()對應的是KafkaPartitionSplitReader的fetch方法。

@Override
public RecordsWithSplitIds<ConsumerRecord<byte[], byte[]>> fetch() throws IOException {
    ConsumerRecords<byte[], byte[]> consumerRecords;
    try {
        // 調用KafkaConsumer拉取一批消息，超時時間爲10s
        consumerRecords = consumer.poll(Duration.ofMillis(POLL_TIMEOUT));
    } catch (WakeupException | IllegalStateException e) {
        // IllegalStateException will be thrown if the consumer is not assigned any partitions.
        // This happens if all assigned partitions are invalid or empty (starting offset >=
        // stopping offset). We just mark empty partitions as finished and return an empty
        // record container, and this consumer will be closed by SplitFetcherManager.
        // 如註釋所說，如果consumer沒有指定消費的partition，會拋出IllegalStateException
        // 所有已分配的partition無效或者是爲空(起始offset >= 停止offset)的時候也會出現這種情況
        // 返回空的KafkaPartitionSplitRecords，並且標記分區已完成
        // 這個consumer稍後會被SplitFetcherManager關閉
        KafkaPartitionSplitRecords recordsBySplits =
                new KafkaPartitionSplitRecords(
                        ConsumerRecords.empty(), kafkaSourceReaderMetrics);
        markEmptySplitsAsFinished(recordsBySplits);
        return recordsBySplits;
    }
    // 將consumerRecords包裝在KafkaPartitionSplitRecords中返回
    // KafkaPartitionSplitRecords具有pattition和record兩個iterator
    KafkaPartitionSplitRecords recordsBySplits =
            new KafkaPartitionSplitRecords(consumerRecords, kafkaSourceReaderMetrics);
    List<TopicPartition> finishedPartitions = new ArrayList<>();
    // 遍歷consumerRecords中的partition
    for (TopicPartition tp : consumerRecords.partitions()) {
        // 獲取分區停止offset
        long stoppingOffset = getStoppingOffset(tp);
        // 讀取這個partition的所有數據
        final List<ConsumerRecord<byte[], byte[]>> recordsFromPartition =
                consumerRecords.records(tp);

        // 如果讀取到數據
        if (recordsFromPartition.size() > 0) {
            // 獲取該分區最後一條讀取到的數據
            final ConsumerRecord<byte[], byte[]> lastRecord =
                    recordsFromPartition.get(recordsFromPartition.size() - 1);

            // After processing a record with offset of "stoppingOffset - 1", the split reader
            // should not continue fetching because the record with stoppingOffset may not
            // exist. Keep polling will just block forever.
            // 如果最後一條消息的offset大於等於stoppingOffset
            // stopping使用consumer的endOffsets方法獲取，
            // 設置recordsBySplits的結束offset
            // 然後標記這個split爲已完成
            if (lastRecord.offset() >= stoppingOffset - 1) {
                recordsBySplits.setPartitionStoppingOffset(tp, stoppingOffset);
                finishSplitAtRecord(
                        tp,
                        stoppingOffset,
                        lastRecord.offset(),
                        finishedPartitions,
                        recordsBySplits);
            }
        }
        // Track this partition's record lag if it never appears before
        // 添加kafka記錄延遲監控
        kafkaSourceReaderMetrics.maybeAddRecordsLagMetric(consumer, tp);
    }

    // 將空的split標記爲已完成的split
    markEmptySplitsAsFinished(recordsBySplits);

    // Unassign the partitions that has finished.
    // 不再記錄已完成分區記錄的延遲
    // 取消分配這些分區
    if (!finishedPartitions.isEmpty()) {
        finishedPartitions.forEach(kafkaSourceReaderMetrics::removeRecordsLagMetric);
        unassignPartitions(finishedPartitions);
    }

    // Update numBytesIn
    // 更新已讀取的字節數監控數值
    kafkaSourceReaderMetrics.updateNumBytesInCounter();

    return recordsBySplits;
}

到這裏爲止，我們分析完了從KafkaConsumer讀取消息到並放置到ElementQueue的邏輯。接下來是Flink內部將ElementQueue中的數據讀取出來併發送到下游的過程。

SourceReaderBase將數據從elementQueue中讀出然後交給recordEmitter。

SourceReaderBase的getNextFetch方法內容如下：

@Nullable
private RecordsWithSplitIds<E> getNextFetch(final ReaderOutput<T> output) {
    splitFetcherManager.checkErrors();

    LOG.trace("Getting next source data batch from queue");
    // 從elementQueue中拿出一批數據
    final RecordsWithSplitIds<E> recordsWithSplitId = elementsQueue.poll();
    // 如果當前split沒有讀取到數據，並且沒有下一個split，返回null
    if (recordsWithSplitId == null || !moveToNextSplit(recordsWithSplitId, output)) {
        // No element available, set to available later if needed.
        return null;
    }

    currentFetch = recordsWithSplitId;
    return recordsWithSplitId;
}

getNextFetch這個方法在pollNext中調用。SourceOperator調用reader的pollNext方法，不斷拉取數據發送交給recordEmitter發送給下游。

@Override
public InputStatus pollNext(ReaderOutput<T> output) throws Exception {
    // make sure we have a fetch we are working on, or move to the next
    RecordsWithSplitIds<E> recordsWithSplitId = this.currentFetch;
    if (recordsWithSplitId == null) {
        recordsWithSplitId = getNextFetch(output);
        if (recordsWithSplitId == null) {
            return trace(finishedOrAvailableLater());
        }
    }

    // we need to loop here, because we may have to go across splits
    while (true) {
        // Process one record.
        final E record = recordsWithSplitId.nextRecordFromSplit();
        if (record != null) {
            // emit the record.
            numRecordsInCounter.inc(1);
            recordEmitter.emitRecord(record, currentSplitOutput, currentSplitContext.state);
            LOG.trace("Emitted record: {}", record);

            // We always emit MORE_AVAILABLE here, even though we do not strictly know whether
            // more is available. If nothing more is available, the next invocation will find
            // this out and return the correct status.
            // That means we emit the occasional 'false positive' for availability, but this
            // saves us doing checks for every record. Ultimately, this is cheaper.
            return trace(InputStatus.MORE_AVAILABLE);
        } else if (!moveToNextSplit(recordsWithSplitId, output)) {
            // The fetch is done and we just discovered that and have not emitted anything, yet.
            // We need to move to the next fetch. As a shortcut, we call pollNext() here again,
            // rather than emitting nothing and waiting for the caller to call us again.
            return pollNext(output);
        }
    }
}

最後我們一路分析到KafkaRecordEmitter的emitRecord方法。它把接收到的kafka消息逐條反序列化之後，發送給下游output。接着傳遞給下游算子。

@Override
public void emitRecord(
        ConsumerRecord<byte[], byte[]> consumerRecord,
        SourceOutput<T> output,
        KafkaPartitionSplitState splitState)
        throws Exception {
    try {
        sourceOutputWrapper.setSourceOutput(output);
        sourceOutputWrapper.setTimestamp(consumerRecord.timestamp());
        deserializationSchema.deserialize(consumerRecord, sourceOutputWrapper);
        splitState.setCurrentOffset(consumerRecord.offset() + 1);
    } catch (Exception e) {
        throw new IOException("Failed to deserialize consumer record due to", e);
    }
}

分區發現

Flink KafkaSource支持按照配置的規則（topic列表，topic正則表達式或者直接指定分區），以定時任務的形式週期掃描Kafka分區，從而實現Kafka分區動態發現。

KafkaSourceEnumerator的start方法創建出KafkaAdminClient。然後根據partitionDiscoveryIntervalMs(分區自動發現間隔時間)，確定是否週期調用分區發現邏輯。

@Override
public void start() {
    // 創建Kafka admin client
    adminClient = getKafkaAdminClient();
    // 如果配置了分區自動發現時間間隔
    if (partitionDiscoveryIntervalMs > 0) {
        LOG.info(
                "Starting the KafkaSourceEnumerator for consumer group {} "
                        + "with partition discovery interval of {} ms.",
                consumerGroupId,
                partitionDiscoveryIntervalMs);
        // 週期調用getSubscribedTopicPartitions和checkPartitionChanges兩個方法
        context.callAsync(
                this::getSubscribedTopicPartitions,
                this::checkPartitionChanges,
                0,
                partitionDiscoveryIntervalMs);
    } else {
        // 否則只在啓動的時候調用一次
        LOG.info(
                "Starting the KafkaSourceEnumerator for consumer group {} "
                        + "without periodic partition discovery.",
                consumerGroupId);
        context.callAsync(this::getSubscribedTopicPartitions, this::checkPartitionChanges);
    }
}

getSubscribedTopicPartitions方法：

private Set<TopicPartition> getSubscribedTopicPartitions() {
    return subscriber.getSubscribedTopicPartitions(adminClient);
}

這個方法調用KafkaSubscriber，根據配置的條件，獲取訂閱的partition。

KafkaSubscriber具有3個子類，分別對應不同的分區發現規則：

PartitionSetSubscriber: 通過KafkaSourceBuilder的setPartitions方法創建，直接根據partition名稱訂閱內容。
TopicListSubscriber: 根據topic列表獲取訂閱的partition。使用KafkaSourceBuilder的setTopics可以訂閱一系列的topic，使用的subscriber就是這個。
TopicPatternSubscriber: 使用正則表達式匹配topic名稱的方式獲取訂閱的partition。使用KafkaSourceBuilder的setTopicPattern方法的時候會創建此subscriber。

接下來以TopicListSubscriber爲例，分析獲取訂閱partiton的邏輯。

@Override
public Set<TopicPartition> getSubscribedTopicPartitions(AdminClient adminClient) {
    LOG.debug("Fetching descriptions for topics: {}", topics);
    // 使用admin client讀取Kafka topic的元數據
    // 包含指定topic對應的分區信息
    final Map<String, TopicDescription> topicMetadata =
            getTopicMetadata(adminClient, new HashSet<>(topics));

    // 將各個分區的partition信息加入到subscribedPartitions集合，然後返回
    Set<TopicPartition> subscribedPartitions = new HashSet<>();
    for (TopicDescription topic : topicMetadata.values()) {
        for (TopicPartitionInfo partition : topic.partitions()) {
            subscribedPartitions.add(new TopicPartition(topic.name(), partition.partition()));
        }
    }

    return subscribedPartitions;
}

獲取訂閱分區的邏輯不是特別複雜，其他兩個subscriber的邏輯這裏不再分析。

getSubscribedTopicPartitions方法的返回值和異常（如果拋出異常的話）將會傳遞給checkPartitionChange方法。它將檢測分區信息是否發生變更。代碼邏輯如下：

private void checkPartitionChanges(Set<TopicPartition> fetchedPartitions, Throwable t) {
    if (t != null) {
        throw new FlinkRuntimeException(
                "Failed to list subscribed topic partitions due to ", t);
    }
    // 檢測分區變更情況
    final PartitionChange partitionChange = getPartitionChange(fetchedPartitions);
    // 如果沒有變更，直接返回
    if (partitionChange.isEmpty()) {
        return;
    }
    // 如果檢測到變更，調用initializePartitionSplits和handlePartitionSplitChanges方法
    context.callAsync(
            () -> initializePartitionSplits(partitionChange),
            this::handlePartitionSplitChanges);
}

@VisibleForTesting
PartitionChange getPartitionChange(Set<TopicPartition> fetchedPartitions) {
    // 保存被移除的分區
    final Set<TopicPartition> removedPartitions = new HashSet<>();
    Consumer<TopicPartition> dedupOrMarkAsRemoved =
            (tp) -> {
                if (!fetchedPartitions.remove(tp)) {
                    removedPartitions.add(tp);
                }
            };
    // 如果分區在assignedPartitions(已分配分區)存在，在fetchedPartitions中不存在，說明分區已經移出
    // 將其加入到removedPartitions中
    assignedPartitions.forEach(dedupOrMarkAsRemoved);
    // pendingPartitionSplitAssignment爲上輪發現但是還沒有分配給reader讀取的分區
    // 從pendingPartitionSplitAssignment中找到被移除的分區
    pendingPartitionSplitAssignment.forEach(
            (reader, splits) ->
                    splits.forEach(
                            split -> dedupOrMarkAsRemoved.accept(split.getTopicPartition())));

    // 如果fetchedPartitions還有分區沒有remove掉，說明有新發現的分區
    if (!fetchedPartitions.isEmpty()) {
        LOG.info("Discovered new partitions: {}", fetchedPartitions);
    }
    if (!removedPartitions.isEmpty()) {
        LOG.info("Discovered removed partitions: {}", removedPartitions);
    }

    // 包裝新增加的分區和移除的分區到PartitionChange中返回
    return new PartitionChange(fetchedPartitions, removedPartitions);
}

對比完新發現的分區和原本訂閱的分區之後，接下來需要對這些變更做出響應。

initializePartitionSplits方法將分區變更信息包裝爲PartitionSplitChange。這個對象記錄了新增加的分區和移除的分區。和PartitionChange不同的是，PartitionSplitChange包含的新增分區的類型爲KafkaPartitionSplit，它額外保存了分區的起始和終止offset。

private PartitionSplitChange initializePartitionSplits(PartitionChange partitionChange) {
    // 獲取新增的分區
    Set<TopicPartition> newPartitions =
            Collections.unmodifiableSet(partitionChange.getNewPartitions());
    // 獲取分區offset獲取器
    OffsetsInitializer.PartitionOffsetsRetriever offsetsRetriever = getOffsetsRetriever();

    // 獲取起始offset
    Map<TopicPartition, Long> startingOffsets =
            startingOffsetInitializer.getPartitionOffsets(newPartitions, offsetsRetriever);
    // 獲取終止offset
    Map<TopicPartition, Long> stoppingOffsets =
            stoppingOffsetInitializer.getPartitionOffsets(newPartitions, offsetsRetriever);

    Set<KafkaPartitionSplit> partitionSplits = new HashSet<>(newPartitions.size());
    // 將每個分區對應的starting offset和stopping offset包裝起來
    for (TopicPartition tp : newPartitions) {
        Long startingOffset = startingOffsets.get(tp);
        long stoppingOffset =
                stoppingOffsets.getOrDefault(tp, KafkaPartitionSplit.NO_STOPPING_OFFSET);
        partitionSplits.add(new KafkaPartitionSplit(tp, startingOffset, stoppingOffset));
    }
    // 返回結果
    return new PartitionSplitChange(partitionSplits, partitionChange.getRemovedPartitions());
}

上面的方法的關鍵邏輯是獲取各個分區的起始offset（startingOffsetInitializer）和終止offset（KafkaSourceBuilder）。

startingOffsetInitializer在KafkaSourceBuilder中創建，默認爲OffsetsInitializer.earliest()。代碼如下：

static OffsetsInitializer earliest() {
    return new ReaderHandledOffsetsInitializer(
            KafkaPartitionSplit.EARLIEST_OFFSET, OffsetResetStrategy.EARLIEST);
}

它創建出ReaderHandledOffsetsInitializer對象，含義是對於所有新發現的topic，從它們最開頭的地方開始讀取。

ReaderHandledOffsetsInitializer的getPartitionOffsets方法代碼內容如下。它將所有的分區offset設置爲startingOffset，結合前面的場景，即KafkaPartitionSplit.EARLIEST_OFFSET。

@Override
public Map<TopicPartition, Long> getPartitionOffsets(
        Collection<TopicPartition> partitions,
        PartitionOffsetsRetriever partitionOffsetsRetriever) {
    Map<TopicPartition, Long> initialOffsets = new HashMap<>();
    for (TopicPartition tp : partitions) {
        initialOffsets.put(tp, startingOffset);
    }
    return initialOffsets;
}

對於stoppingOffsetInitializer，KafkaSourceBuilder創建的默認爲NoStoppingOffsetsInitializer。含義爲沒有終止offset，針對unbounded（無界）kafka數據流。它的代碼很少，這裏就不再分析了。

我們回到應對分區變更的方法handlePartitionSplitChanges。這個方法將新發現的分區分配給pending和已註冊的reader。

private void handlePartitionSplitChanges(
        PartitionSplitChange partitionSplitChange, Throwable t) {
    if (t != null) {
        throw new FlinkRuntimeException("Failed to initialize partition splits due to ", t);
    }
    if (partitionDiscoveryIntervalMs < 0) {
        LOG.debug("Partition discovery is disabled.");
        noMoreNewPartitionSplits = true;
    }
    // TODO: Handle removed partitions.
    addPartitionSplitChangeToPendingAssignments(partitionSplitChange.newPartitionSplits);
    assignPendingPartitionSplits(context.registeredReaders().keySet());
}

addPartitionSplitChangeToPendingAssignments將分區加入到待讀取(pending)集合中。

private void addPartitionSplitChangeToPendingAssignments(
        Collection<KafkaPartitionSplit> newPartitionSplits) {
    int numReaders = context.currentParallelism();
    for (KafkaPartitionSplit split : newPartitionSplits) {
        // 將這些分區均勻分配給所有的reader
        int ownerReader = getSplitOwner(split.getTopicPartition(), numReaders);
        pendingPartitionSplitAssignment
                .computeIfAbsent(ownerReader, r -> new HashSet<>())
                .add(split);
    }
    LOG.debug(
            "Assigned {} to {} readers of consumer group {}.",
            newPartitionSplits,
            numReaders,
            consumerGroupId);
}

assignPendingPartitionSplits方法分配分區給reader。它的邏輯分析如下：

private void assignPendingPartitionSplits(Set<Integer> pendingReaders) {
    Map<Integer, List<KafkaPartitionSplit>> incrementalAssignment = new HashMap<>();

    // Check if there's any pending splits for given readers
    for (int pendingReader : pendingReaders) {
        // 檢查reader是否已在SourceCoordinator中註冊
        checkReaderRegistered(pendingReader);

        // Remove pending assignment for the reader
        // 獲取這個reader對應的所有分配給它的分區，然後從pendingPartitionSplitAssignment中移除
        final Set<KafkaPartitionSplit> pendingAssignmentForReader =
                pendingPartitionSplitAssignment.remove(pendingReader);

        // 如果有分配給這個reader的分區，將他們加入到incrementalAssignment中
        if (pendingAssignmentForReader != null && !pendingAssignmentForReader.isEmpty()) {
            // Put pending assignment into incremental assignment
            incrementalAssignment
                    .computeIfAbsent(pendingReader, (ignored) -> new ArrayList<>())
                    .addAll(pendingAssignmentForReader);

            // Mark pending partitions as already assigned
            // 標記這些分區爲已分配
            pendingAssignmentForReader.forEach(
                    split -> assignedPartitions.add(split.getTopicPartition()));
        }
    }

    // Assign pending splits to readers
    // 將這些分區分配給reader
    if (!incrementalAssignment.isEmpty()) {
        LOG.info("Assigning splits to readers {}", incrementalAssignment);
        context.assignSplits(new SplitsAssignment<>(incrementalAssignment));
    }

    // If periodically partition discovery is disabled and the initializing discovery has done,
    // signal NoMoreSplitsEvent to pending readers
    // 如果沒有新的分片(分區發現被關閉)，並且設置爲有界模式
    // 給reader發送沒有更多分片信號(signalNoMoreSplits)
    if (noMoreNewPartitionSplits && boundedness == Boundedness.BOUNDED) {
        LOG.debug(
                "No more KafkaPartitionSplits to assign. Sending NoMoreSplitsEvent to reader {}"
                        + " in consumer group {}.",
                pendingReaders,
                consumerGroupId);
        pendingReaders.forEach(context::signalNoMoreSplits);
    }
}

調用assignPendingPartitionSplits方法的地方有三處：

addSplitsBack: 某個reader執行失敗，在上次成功checkpoint之後分配給這個reader的split需要再添加回SplitEnumerator中。
addReader: 增加新的reader。需要給新的reader分配split。
handlePartitionSplitChanges: 上面介紹的檢測到分區變更的時候，需要爲reader分配新發現的分區。

接着我們關心的問題是這些分片是如何添加給SplitEnumerator的。我們展開分析context.assignSplits調用。這裏的context實現類爲SourceCoordinatorContext。繼續分析SourceCoordinatorContext::assignSplits方法代碼：

@Override
public void assignSplits(SplitsAssignment<SplitT> assignment) {
    // Ensure the split assignment is done by the coordinator executor.
    // 在SourceCoordinator線程中調用
    callInCoordinatorThread(
            () -> {
                // Ensure all the subtasks in the assignment have registered.
                // 逐個檢查需要分配的split所屬的reader是否已註冊過
                assignment
                        .assignment()
                        .forEach(
                                (id, splits) -> {
                                    if (!registeredReaders.containsKey(id)) {
                                        throw new IllegalArgumentException(
                                                String.format(
                                                        "Cannot assign splits %s to subtask %d because the subtask is not registered.",
                                                        splits, id));
                                    }
                                });

                // 記錄已分配的assignment（加入到尚未checkpoint的assignment集合中）
                assignmentTracker.recordSplitAssignment(assignment);
                // 分配split
                assignSplitsToAttempts(assignment);
                return null;
            },
            String.format("Failed to assign splits %s due to ", assignment));
}

assignSplitsToAttempts有好幾個重載方法。一路跟隨到最後，它創建出了AddSplitEvent對象，通過OperatorCoordinator發送這個事件給SourceOperator。代碼如下所示：

private void assignSplitsToAttempts(SplitsAssignment<SplitT> assignment) {
    assignment.assignment().forEach((index, splits) -> assignSplitsToAttempts(index, splits));
}

private void assignSplitsToAttempts(int subtaskIndex, List<SplitT> splits) {
    getRegisteredAttempts(subtaskIndex)
            .forEach(attempt -> assignSplitsToAttempt(subtaskIndex, attempt, splits));
}

private void assignSplitsToAttempt(int subtaskIndex, int attemptNumber, List<SplitT> splits) {
    if (splits.isEmpty()) {
        return;
    }

    checkAttemptReaderReady(subtaskIndex, attemptNumber);

    final AddSplitEvent<SplitT> addSplitEvent;
    try {
        // 創建AddSplitEvent(添加split事件)
        addSplitEvent = new AddSplitEvent<>(splits, splitSerializer);
    } catch (IOException e) {
        throw new FlinkRuntimeException("Failed to serialize splits.", e);
    }

    final OperatorCoordinator.SubtaskGateway gateway =
            subtaskGateways.getGatewayAndCheckReady(subtaskIndex, attemptNumber);
    // 將事件發送給subtaskIndex對應的SourceOperator
    gateway.sendEvent(addSplitEvent);
}

gateway.sendEvent() -> SourceOperator::handleOperatorEvent

網絡通信之間的過程這裏不再分析了。我們查看SourceOperator接收event的方法handleOperatorEvent，內容如下：

public void handleOperatorEvent(OperatorEvent event) {
    if (event instanceof WatermarkAlignmentEvent) {
        updateMaxDesiredWatermark((WatermarkAlignmentEvent) event);
        checkWatermarkAlignment();
        checkSplitWatermarkAlignment();
    } else if (event instanceof AddSplitEvent) {
        handleAddSplitsEvent(((AddSplitEvent<SplitT>) event));
    } else if (event instanceof SourceEventWrapper) {
        sourceReader.handleSourceEvents(((SourceEventWrapper) event).getSourceEvent());
    } else if (event instanceof NoMoreSplitsEvent) {
        sourceReader.notifyNoMoreSplits();
    } else {
        throw new IllegalStateException("Received unexpected operator event " + event);
    }
}

如果接收到的事件類型爲AddSplitEvent，調用handleAddSplitsEvent方法。分析如下：

private void handleAddSplitsEvent(AddSplitEvent<SplitT> event) {
    try {
        // 反序列化得到split信息
        List<SplitT> newSplits = event.splits(splitSerializer);
        numSplits += newSplits.size();
        // 如果下游output還沒有初始化，加入到pending集合中緩存起來
        // 否則創建output，將split分配給這些output
        if (operatingMode == OperatingMode.OUTPUT_NOT_INITIALIZED) {
            // For splits arrived before the main output is initialized, store them into the
            // pending list. Outputs of these splits will be created once the main output is
            // ready.
            outputPendingSplits.addAll(newSplits);
        } else {
            // Create output directly for new splits if the main output is already initialized.
            createOutputForSplits(newSplits);
        }
        // 將split添加到sourceReader
        sourceReader.addSplits(newSplits);
    } catch (IOException e) {
        throw new FlinkRuntimeException("Failed to deserialize the splits.", e);
    }
}

最後我們一路跟蹤到SourceReaderBase的addSplits方法。

@Override
public void addSplits(List<SplitT> splits) {
    LOG.info("Adding split(s) to reader: {}", splits);
    // Initialize the state for each split.
    splits.forEach(
            s ->
                    splitStates.put(
                            s.splitId(), new SplitContext<>(s.splitId(), initializedState(s))));
    // Hand over the splits to the split fetcher to start fetch.
    splitFetcherManager.addSplits(splits);
}

它把split交給splitFetcherManager執行。在本篇KafkaSource環境下它的實現類爲KafkaSourceFetcherManager。它的addSplits方法位於父類SingleThreadFetcherManager中。

分析到這裏，我們回到了上一節"數據讀取流程"的開頭"添加分片"方法。至此KafkaSource分區發現邏輯分析完畢。

Checkpoint邏輯

KafkaSourceReader的snapshotState方法返回當前需要checkpoint的分片信息，即Reader分配的分片。如果用戶配置了commit.offsets.on.checkpoint=true，保存各個分片對應的分區和offset分區到offsetsToCommit中。

@Override
public List<KafkaPartitionSplit> snapshotState(long checkpointId) {
    // 獲取分配給當前Reader的分片（checkpointId參數實際上沒有用到）
    List<KafkaPartitionSplit> splits = super.snapshotState(checkpointId);
    // 由配置項commit.offsets.on.checkpoint決定
    // 是否在checkpoint的時候，提交offset
    if (!commitOffsetsOnCheckpoint) {
        return splits;
    }

    // 下面邏輯只有在開啓commit.offsets.on.checkpoint的時候纔會執行
    // offsetToCommit保存了需要commit的offset信息
    // 是一個Map<checkpointID, Map<partition, offset>>數據結構
    // 如果當前Reader沒有分片，並且也沒有讀取完畢的分片
    // offsetsToCommit記錄checkpoint id對應一個空的map
    if (splits.isEmpty() && offsetsOfFinishedSplits.isEmpty()) {
        offsetsToCommit.put(checkpointId, Collections.emptyMap());
    } else {
        // 爲當前checkpoint id創建一個offsetMap，保存在offsetsToCommit中
        Map<TopicPartition, OffsetAndMetadata> offsetsMap =
                offsetsToCommit.computeIfAbsent(checkpointId, id -> new HashMap<>());
        // Put the offsets of the active splits.
        // 遍歷splits，保存split對應的分區和offset到offsetsMap中
        for (KafkaPartitionSplit split : splits) {
            // If the checkpoint is triggered before the partition starting offsets
            // is retrieved, do not commit the offsets for those partitions.
            if (split.getStartingOffset() >= 0) {
                offsetsMap.put(
                        split.getTopicPartition(),
                        new OffsetAndMetadata(split.getStartingOffset()));
            }
        }
        // 保存所有完成讀取的split的partition和offset信息
        // Put offsets of all the finished splits.
        offsetsMap.putAll(offsetsOfFinishedSplits);
    }
    return splits;
}

notifyCheckpointComplete方法。該方法在checkpoint完畢的時候執行。由SourceCoordinator發送checkpoint完畢通知。在這個方法中Kafka數據源提交Kafka offset。

@Override
public void notifyCheckpointComplete(long checkpointId) throws Exception {
    LOG.debug("Committing offsets for checkpoint {}", checkpointId);
    // 同上，如果沒有啓用checkpoint時候提交offset的配置，方法退出，什麼也不做
    if (!commitOffsetsOnCheckpoint) {
        return;
    }

    // 從offsetsToCommit中獲取當前checkpoint需要提交的分區offset信息
    Map<TopicPartition, OffsetAndMetadata> committedPartitions =
            offsetsToCommit.get(checkpointId);
    // 如果爲空，退出
    if (committedPartitions == null) {
        LOG.debug(
                "Offsets for checkpoint {} either do not exist or have already been committed.",
                checkpointId);
        return;
    }

    // 調用KafkaSourceFetcherManager，提交offset到kafka
    // 稍後分析
    ((KafkaSourceFetcherManager) splitFetcherManager)
            .commitOffsets(
                    committedPartitions,
                    (ignored, e) -> {
                        // The offset commit here is needed by the external monitoring. It won't
                        // break Flink job's correctness if we fail to commit the offset here.
                        // 這裏是提交offset的回調函數
                        // 如果遇到錯誤，監控指標記錄下失敗的提交
                        if (e != null) {
                            kafkaSourceReaderMetrics.recordFailedCommit();
                            LOG.warn(
                                    "Failed to commit consumer offsets for checkpoint {}",
                                    checkpointId,
                                    e);
                        } else {
                            LOG.debug(
                                    "Successfully committed offsets for checkpoint {}",
                                    checkpointId);
                            // 監控指標記錄成功的提交
                            kafkaSourceReaderMetrics.recordSucceededCommit();
                            // If the finished topic partition has been committed, we remove it
                            // from the offsets of the finished splits map.
                            committedPartitions.forEach(
                                    (tp, offset) ->
                                            kafkaSourceReaderMetrics.recordCommittedOffset(
                                                    tp, offset.offset()));
                            // 由於offset已提交，從已完成split集合中移除
                            offsetsOfFinishedSplits
                                    .entrySet()
                                    .removeIf(
                                            entry ->
                                                    committedPartitions.containsKey(
                                                            entry.getKey()));
                            // 移除當前以及之前的checkpoint id對應的offset信息，因爲已經commit過，無需再保存
                            while (!offsetsToCommit.isEmpty()
                                    && offsetsToCommit.firstKey() <= checkpointId) {
                                offsetsToCommit.remove(offsetsToCommit.firstKey());
                            }
                        }
                    });
}

接下來我們關注KafkaSourceFetcherManager。這個類負責向KafkaConsumer提交offset，邏輯對應commitOffsets方法，內容如下：

public void commitOffsets(
        Map<TopicPartition, OffsetAndMetadata> offsetsToCommit, OffsetCommitCallback callback) {
    LOG.debug("Committing offsets {}", offsetsToCommit);
    // 如果沒有offset需要commit，返回
    if (offsetsToCommit.isEmpty()) {
        return;
    }
    // 獲取正在運行的SplitFetcher
    SplitFetcher<ConsumerRecord<byte[], byte[]>, KafkaPartitionSplit> splitFetcher =
            fetchers.get(0);
    if (splitFetcher != null) {
        // The fetcher thread is still running. This should be the majority of the cases.
        // 如果fetcher仍在運行，創建提交offset的任務，加入隊列
        enqueueOffsetsCommitTask(splitFetcher, offsetsToCommit, callback);
    } else {
        // 如果沒有SplitFetcher運行，創建一個新的SplitFetcher
        // 和上面異常創建任務之後，啓動這個SplitFetcher
        splitFetcher = createSplitFetcher();
        enqueueOffsetsCommitTask(splitFetcher, offsetsToCommit, callback);
        startFetcher(splitFetcher);
    }
}

繼續分析創建offset提交任務的方法。代碼如下：

private void enqueueOffsetsCommitTask(
        SplitFetcher<ConsumerRecord<byte[], byte[]>, KafkaPartitionSplit> splitFetcher,
        Map<TopicPartition, OffsetAndMetadata> offsetsToCommit,
        OffsetCommitCallback callback) {
    // 獲取splitFetcher對應的KafkaReader
    KafkaPartitionSplitReader kafkaReader =
            (KafkaPartitionSplitReader) splitFetcher.getSplitReader();

    爲fetcher創建一個SplitFetcherTask
    splitFetcher.enqueueTask(
            new SplitFetcherTask() {
                @Override
                public boolean run() throws IOException {
                    kafkaReader.notifyCheckpointComplete(offsetsToCommit, callback);
                    return true;
                }

                @Override
                public void wakeUp() {}
            });
}

到此，一個SplitFetcherTask已被添加到SplitFetcher的taskQueue中。根據我們在前面"數據讀取流程"分析的結論，SplitFetcher通過runOnce方法逐個讀取taskQueue中排隊的任務執行。當它取出SplitFetcherTask時，會運行它的run方法。調用kafkaReader.notifyCheckpointComplete方法。這個方法調用KafkaConsumer的異步提交offset方法commitAsync。

public void notifyCheckpointComplete(
        Map<TopicPartition, OffsetAndMetadata> offsetsToCommit,
        OffsetCommitCallback offsetCommitCallback) {
    consumer.commitAsync(offsetsToCommit, offsetCommitCallback);
}

到這裏，KafkaSource checkpoint提交offset的過程分析完畢。

本博客爲作者原創，歡迎大家參與討論和批評指正。如需轉載請註明出處。

Flink 源碼之 KafkaSource Flink源碼分析系列文檔目錄前言 KafkaSource創建數據讀取流程分區發現 Checkpoint邏輯

Flink源碼分析系列文檔目錄

前言

KafkaSource創建

數據讀取流程

分區發現

Checkpoint邏輯

再談23種設計模式（3）：行爲型模式（學習筆記）

Power Automate Desktop 安裝完，登錄後老是提示one driver 錯誤

微前端學習筆記(4):從微前端到微模塊之EMP與hel-micro方案探索

微前端學習筆記（1）：微前端總體架構概述，從微服務發微

985 碩士程序員，空窗 4 個月沒有 Offer！

一文搞懂 Spring 循環依賴

賽博鬥地主——使用大語言模型扮演Agent智能體玩牌類遊戲。

VScode右鍵打開(添加到右鍵)

記一次 .NET某工控視覺自動化系統卡死分析

WindowsServer--SQL Server搭建主從同步實現讀寫分離 - 事務性分發

Spark metastore 配置背景和問題環境信息 Spark SQL自身metastore使用MySQL替換Derby 使用Hive metastore

Spring shell 簡易使用指南前言環境準備命令編寫自定義配置單元測試執行外部命令參考文獻

Flink 使用之 Yarn 資源問題排查 Flink 使用介紹相關文檔目錄前言典型報錯確定Flink使用的資源 Yarn資源相關配置 Flink資源計算方法參考鏈接

Ambari Python 運維腳本執行流程分析前言執行邏輯分析

Zookeeper 3.6.3+ 兼容老版本 rmr 命令的方法背景操作步驟後記

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

Flink 源碼之 KafkaSource Flink源碼分析系列文檔目錄 前言 KafkaSource創建 數據讀取流程 分區發現 Checkpoint邏輯

Flink源碼分析系列文檔目錄

前言

KafkaSource創建

數據讀取流程

分區發現

Checkpoint邏輯

Flink 源碼之 KafkaSource Flink源碼分析系列文檔目錄前言 KafkaSource創建數據讀取流程分區發現 Checkpoint邏輯