Flink 源碼之 KafkaSource Flink源碼分析系列文檔目錄 前言 KafkaSource創建 數據讀取流程 分區發現 Checkpoint邏輯

Flink源碼分析系列文檔目錄

請點擊:Flink 源碼分析系列文檔目錄

前言

FLIP-27: Refactor Source Interface - Apache Flink - Apache Software Foundation提出了新的Source架構。該新架構的分析請參見Flink 源碼之新 Source 架構。針對這個新架構,Flink社區新推出了新的Kafka connector - KafkaSource。老版本的實現FlinkKafkaConsumer目前被標記爲Deprecated,不再推薦使用。本篇展開KafkaSource的源代碼分析。

本篇包含4個部分的源代碼分析:

  • KafkaSource創建
  • 數據讀取
  • 分區發現
  • checkpoint

KafkaSource創建

如官網所示,編寫Flink消費Kafka場景應用,我們可以按照如下方式創建KafkaSource:

KafkaSource<String> source = KafkaSource.<String>builder()
    .setBootstrapServers(brokers)
    .setTopics("input-topic")
    .setGroupId("my-group")
    .setStartingOffsets(OffsetsInitializer.earliest())
    .setValueOnlyDeserializer(new SimpleStringSchema())
    .build();

env.fromSource(source, WatermarkStrategy.noWatermarks(), "Kafka Source");

env.fromSource生成了一個DataStreamSourceDataStreamSource對應了SourceTransformation,然後經過SourceTransformationTranslator翻譯成StreamGraphSource節點,執行的時候對應的是SourceOperatorSourceOperator是新Source API對應的Operator。它直接和SourceReader交互。調用sourceReader.pollNext方法拉取數據。這一連串邏輯與Kafka關係不大,不再展開介紹,瞭解即可。

最終,KafkaSourceBuilder按照我們配置的參數,返回符合要求的kafkaSource對象。

public KafkaSource<OUT> build() {
    sanityCheck();
    parseAndSetRequiredProperties();
    return new KafkaSource<>(
            subscriber,
            startingOffsetsInitializer,
            stoppingOffsetsInitializer,
            boundedness,
            deserializationSchema,
            props);
}

KafkaSourcecreateReader方法生成KafkaSourceReader。代碼如下:

@Internal
@Override
public SourceReader<OUT, KafkaPartitionSplit> createReader(SourceReaderContext readerContext)
        throws Exception {
    return createReader(readerContext, (ignore) -> {});
}

@VisibleForTesting
SourceReader<OUT, KafkaPartitionSplit> createReader(
        SourceReaderContext readerContext, Consumer<Collection<String>> splitFinishedHook)
        throws Exception {
    // elementQueue用來存放從fetcher獲取到的ConsumerRecord
    // reader從elementQueue讀取緩存的Kafka消息
    FutureCompletingBlockingQueue<RecordsWithSplitIds<ConsumerRecord<byte[], byte[]>>>
            elementsQueue = new FutureCompletingBlockingQueue<>();
    // 初始化deserializationSchema
    deserializationSchema.open(
            new DeserializationSchema.InitializationContext() {
                @Override
                public MetricGroup getMetricGroup() {
                    return readerContext.metricGroup().addGroup("deserializer");
                }

                @Override
                public UserCodeClassLoader getUserCodeClassLoader() {
                    return readerContext.getUserCodeClassLoader();
                }
            });
    // 創建Kafka數據源監控
    final KafkaSourceReaderMetrics kafkaSourceReaderMetrics =
            new KafkaSourceReaderMetrics(readerContext.metricGroup());

    // 創建一個工廠方法,用於創建KafkaPartitionSplitReader。它按照分區讀取Kafka消息
    Supplier<KafkaPartitionSplitReader> splitReaderSupplier =
            () -> new KafkaPartitionSplitReader(props, readerContext, kafkaSourceReaderMetrics);
    KafkaRecordEmitter<OUT> recordEmitter = new KafkaRecordEmitter<>(deserializationSchema);

    return new KafkaSourceReader<>(
            elementsQueue,
            new KafkaSourceFetcherManager(
                    elementsQueue, splitReaderSupplier::get, splitFinishedHook),
            recordEmitter,
            toConfiguration(props),
            readerContext,
            kafkaSourceReaderMetrics);
}

數據讀取流程

KafkaSourceFetcherManager繼承了SingleThreadFetcherManager。當發現數據分片的時候,它會獲取已有的SplitFetcher,將split指派給它。如果沒有正在運行的fetcher,創建一個新的。

@Override
// 發現新的分片的時候調用這個方法
// 將分片指派給fetcher
public void addSplits(List<SplitT> splitsToAdd) {
    SplitFetcher<E, SplitT> fetcher = getRunningFetcher();
    if (fetcher == null) {
        fetcher = createSplitFetcher();
        // Add the splits to the fetchers.
        fetcher.addSplits(splitsToAdd);
        startFetcher(fetcher);
    } else {
        fetcher.addSplits(splitsToAdd);
    }
}

然後我們分析fetcher如何拉取數據的。上面的startFetcher方法啓動SplitFetcher線程。

protected void startFetcher(SplitFetcher<E, SplitT> fetcher) {
    executors.submit(fetcher);
}

SplitFetcher用於執行從外部系統拉取數據的任務,它一直循環運行SplitFetchTaskSplitFetchTask有多個子類:

  • AddSplitTask: 爲reader指派split的任務
  • PauseOrResumeSplitsTask: 暫停或恢復Split讀取的任務
  • FetchTask: 拉取數據到elemeQueue中

接下來分析SplitFetcherrun方法:

@Override
public void run() {
    LOG.info("Starting split fetcher {}", id);
    try {
        // 循環執行runOnce方法
        while (runOnce()) {
            // nothing to do, everything is inside #runOnce.
        }
    } catch (Throwable t) {
        errorHandler.accept(t);
    } finally {
        try {
            splitReader.close();
        } catch (Exception e) {
            errorHandler.accept(e);
        } finally {
            LOG.info("Split fetcher {} exited.", id);
            // This executes after possible errorHandler.accept(t). If these operations bear
            // a happens-before relation, then we can checking side effect of
            // errorHandler.accept(t)
            // to know whether it happened after observing side effect of shutdownHook.run().
            shutdownHook.run();
        }
    }
}

boolean runOnce() {
    // first blocking call = get next task. blocks only if there are no active splits and queued
    // tasks.
    SplitFetcherTask task;
    lock.lock();
    try {
        if (closed) {
            return false;
        }

        // 重要邏輯在此
        // 這裏從taskQueue中獲取一個任務
        // 如果隊列中有積壓的任務,優先運行之
        // 如果taskQueue爲空,檢查是否有已分配的split,如果有的話返回一個FetchTask
        // FetchTask在SplitFetcher構造e時候被創建出來
        task = getNextTaskUnsafe();
        if (task == null) {
            // (spurious) wakeup, so just repeat
            return true;
        }

        LOG.debug("Prepare to run {}", task);
        // store task for #wakeUp
        this.runningTask = task;
    } finally {
        lock.unlock();
    }

    // execute the task outside of lock, so that it can be woken up
    boolean taskFinished;
    try {
        // 執行task的run方法
        taskFinished = task.run();
    } catch (Exception e) {
        throw new RuntimeException(
                String.format(
                        "SplitFetcher thread %d received unexpected exception while polling the records",
                        id),
                e);
    }

    // re-acquire lock as all post-processing steps, need it
    lock.lock();
    try {
        this.runningTask = null;
        processTaskResultUnsafe(task, taskFinished);
    } finally {
        lock.unlock();
    }
    return true;
}

用來拉取數據的SplitFetchTask子類爲FetchTask。它的run方法代碼如下所示:

@Override
public boolean run() throws IOException {
    try {
        // 在wakeup狀態會跳過這一輪執行
        if (!isWakenUp() && lastRecords == null) {
            // 調用splitReader從split拉取一批數據
            lastRecords = splitReader.fetch();
        }

        if (!isWakenUp()) {
            // The order matters here. We must first put the last records into the queue.
            // This ensures the handling of the fetched records is atomic to wakeup.
            // 將讀取到的數據放入到elementQueue中
            if (elementsQueue.put(fetcherIndex, lastRecords)) {
                //如果有已經讀取完的split
                if (!lastRecords.finishedSplits().isEmpty()) {
                    // The callback does not throw InterruptedException.
                    // 調用讀取完成回調函數
                    splitFinishedCallback.accept(lastRecords.finishedSplits());
                }
                lastRecords = null;
            }
        }
    } catch (InterruptedException e) {
        // this should only happen on shutdown
        throw new IOException("Source fetch execution was interrupted", e);
    } finally {
        // clean up the potential wakeup effect. It is possible that the fetcher is waken up
        // after the clean up. In that case, either the wakeup flag will be set or the
        // running thread will be interrupted. The next invocation of run() will see that and
        // just skip.
        if (isWakenUp()) {
            wakeup = false;
        }
    }
    // The return value of fetch task does not matter.
    return true;
}

上面代碼片段中splitReader.fetch()對應的是KafkaPartitionSplitReaderfetch方法。

@Override
public RecordsWithSplitIds<ConsumerRecord<byte[], byte[]>> fetch() throws IOException {
    ConsumerRecords<byte[], byte[]> consumerRecords;
    try {
        // 調用KafkaConsumer拉取一批消息,超時時間爲10s
        consumerRecords = consumer.poll(Duration.ofMillis(POLL_TIMEOUT));
    } catch (WakeupException | IllegalStateException e) {
        // IllegalStateException will be thrown if the consumer is not assigned any partitions.
        // This happens if all assigned partitions are invalid or empty (starting offset >=
        // stopping offset). We just mark empty partitions as finished and return an empty
        // record container, and this consumer will be closed by SplitFetcherManager.
        // 如註釋所說,如果consumer沒有指定消費的partition,會拋出IllegalStateException
        // 所有已分配的partition無效或者是爲空(起始offset >= 停止offset)的時候也會出現這種情況
        // 返回空的KafkaPartitionSplitRecords,並且標記分區已完成
        // 這個consumer稍後會被SplitFetcherManager關閉
        KafkaPartitionSplitRecords recordsBySplits =
                new KafkaPartitionSplitRecords(
                        ConsumerRecords.empty(), kafkaSourceReaderMetrics);
        markEmptySplitsAsFinished(recordsBySplits);
        return recordsBySplits;
    }
    // 將consumerRecords包裝在KafkaPartitionSplitRecords中返回
    // KafkaPartitionSplitRecords具有pattition和record兩個iterator
    KafkaPartitionSplitRecords recordsBySplits =
            new KafkaPartitionSplitRecords(consumerRecords, kafkaSourceReaderMetrics);
    List<TopicPartition> finishedPartitions = new ArrayList<>();
    // 遍歷consumerRecords中的partition
    for (TopicPartition tp : consumerRecords.partitions()) {
        // 獲取分區停止offset
        long stoppingOffset = getStoppingOffset(tp);
        // 讀取這個partition的所有數據
        final List<ConsumerRecord<byte[], byte[]>> recordsFromPartition =
                consumerRecords.records(tp);

        // 如果讀取到數據
        if (recordsFromPartition.size() > 0) {
            // 獲取該分區最後一條讀取到的數據
            final ConsumerRecord<byte[], byte[]> lastRecord =
                    recordsFromPartition.get(recordsFromPartition.size() - 1);

            // After processing a record with offset of "stoppingOffset - 1", the split reader
            // should not continue fetching because the record with stoppingOffset may not
            // exist. Keep polling will just block forever.
            // 如果最後一條消息的offset大於等於stoppingOffset
            // stopping使用consumer的endOffsets方法獲取,
            // 設置recordsBySplits的結束offset
            // 然後標記這個split爲已完成
            if (lastRecord.offset() >= stoppingOffset - 1) {
                recordsBySplits.setPartitionStoppingOffset(tp, stoppingOffset);
                finishSplitAtRecord(
                        tp,
                        stoppingOffset,
                        lastRecord.offset(),
                        finishedPartitions,
                        recordsBySplits);
            }
        }
        // Track this partition's record lag if it never appears before
        // 添加kafka記錄延遲監控
        kafkaSourceReaderMetrics.maybeAddRecordsLagMetric(consumer, tp);
    }

    // 將空的split標記爲已完成的split
    markEmptySplitsAsFinished(recordsBySplits);

    // Unassign the partitions that has finished.
    // 不再記錄已完成分區記錄的延遲
    // 取消分配這些分區
    if (!finishedPartitions.isEmpty()) {
        finishedPartitions.forEach(kafkaSourceReaderMetrics::removeRecordsLagMetric);
        unassignPartitions(finishedPartitions);
    }

    // Update numBytesIn
    // 更新已讀取的字節數監控數值
    kafkaSourceReaderMetrics.updateNumBytesInCounter();

    return recordsBySplits;
}

到這裏爲止,我們分析完了從KafkaConsumer讀取消息到並放置到ElementQueue的邏輯。接下來是Flink內部將ElementQueue中的數據讀取出來併發送到下游的過程。

SourceReaderBase將數據從elementQueue中讀出然後交給recordEmitter

SourceReaderBasegetNextFetch方法內容如下:

@Nullable
private RecordsWithSplitIds<E> getNextFetch(final ReaderOutput<T> output) {
    splitFetcherManager.checkErrors();

    LOG.trace("Getting next source data batch from queue");
    // 從elementQueue中拿出一批數據
    final RecordsWithSplitIds<E> recordsWithSplitId = elementsQueue.poll();
    // 如果當前split沒有讀取到數據,並且沒有下一個split,返回null
    if (recordsWithSplitId == null || !moveToNextSplit(recordsWithSplitId, output)) {
        // No element available, set to available later if needed.
        return null;
    }

    currentFetch = recordsWithSplitId;
    return recordsWithSplitId;
}

getNextFetch這個方法在pollNext中調用。SourceOperator調用reader的pollNext方法,不斷拉取數據發送交給recordEmitter發送給下游。

@Override
public InputStatus pollNext(ReaderOutput<T> output) throws Exception {
    // make sure we have a fetch we are working on, or move to the next
    RecordsWithSplitIds<E> recordsWithSplitId = this.currentFetch;
    if (recordsWithSplitId == null) {
        recordsWithSplitId = getNextFetch(output);
        if (recordsWithSplitId == null) {
            return trace(finishedOrAvailableLater());
        }
    }

    // we need to loop here, because we may have to go across splits
    while (true) {
        // Process one record.
        final E record = recordsWithSplitId.nextRecordFromSplit();
        if (record != null) {
            // emit the record.
            numRecordsInCounter.inc(1);
            recordEmitter.emitRecord(record, currentSplitOutput, currentSplitContext.state);
            LOG.trace("Emitted record: {}", record);

            // We always emit MORE_AVAILABLE here, even though we do not strictly know whether
            // more is available. If nothing more is available, the next invocation will find
            // this out and return the correct status.
            // That means we emit the occasional 'false positive' for availability, but this
            // saves us doing checks for every record. Ultimately, this is cheaper.
            return trace(InputStatus.MORE_AVAILABLE);
        } else if (!moveToNextSplit(recordsWithSplitId, output)) {
            // The fetch is done and we just discovered that and have not emitted anything, yet.
            // We need to move to the next fetch. As a shortcut, we call pollNext() here again,
            // rather than emitting nothing and waiting for the caller to call us again.
            return pollNext(output);
        }
    }
}

最後我們一路分析到KafkaRecordEmitteremitRecord方法。它把接收到的kafka消息逐條反序列化之後,發送給下游output。接着傳遞給下游算子。

@Override
public void emitRecord(
        ConsumerRecord<byte[], byte[]> consumerRecord,
        SourceOutput<T> output,
        KafkaPartitionSplitState splitState)
        throws Exception {
    try {
        sourceOutputWrapper.setSourceOutput(output);
        sourceOutputWrapper.setTimestamp(consumerRecord.timestamp());
        deserializationSchema.deserialize(consumerRecord, sourceOutputWrapper);
        splitState.setCurrentOffset(consumerRecord.offset() + 1);
    } catch (Exception e) {
        throw new IOException("Failed to deserialize consumer record due to", e);
    }
}

分區發現

Flink KafkaSource支持按照配置的規則(topic列表,topic正則表達式或者直接指定分區),以定時任務的形式週期掃描Kafka分區,從而實現Kafka分區動態發現。

KafkaSourceEnumeratorstart方法創建出KafkaAdminClient。然後根據partitionDiscoveryIntervalMs(分區自動發現間隔時間),確定是否週期調用分區發現邏輯。

@Override
public void start() {
    // 創建Kafka admin client
    adminClient = getKafkaAdminClient();
    // 如果配置了分區自動發現時間間隔
    if (partitionDiscoveryIntervalMs > 0) {
        LOG.info(
                "Starting the KafkaSourceEnumerator for consumer group {} "
                        + "with partition discovery interval of {} ms.",
                consumerGroupId,
                partitionDiscoveryIntervalMs);
        // 週期調用getSubscribedTopicPartitions和checkPartitionChanges兩個方法
        context.callAsync(
                this::getSubscribedTopicPartitions,
                this::checkPartitionChanges,
                0,
                partitionDiscoveryIntervalMs);
    } else {
        // 否則只在啓動的時候調用一次
        LOG.info(
                "Starting the KafkaSourceEnumerator for consumer group {} "
                        + "without periodic partition discovery.",
                consumerGroupId);
        context.callAsync(this::getSubscribedTopicPartitions, this::checkPartitionChanges);
    }
}

getSubscribedTopicPartitions方法:

private Set<TopicPartition> getSubscribedTopicPartitions() {
    return subscriber.getSubscribedTopicPartitions(adminClient);
}

這個方法調用KafkaSubscriber,根據配置的條件,獲取訂閱的partition。

KafkaSubscriber具有3個子類,分別對應不同的分區發現規則:

  • PartitionSetSubscriber: 通過KafkaSourceBuildersetPartitions方法創建,直接根據partition名稱訂閱內容。
  • TopicListSubscriber: 根據topic列表獲取訂閱的partition。使用KafkaSourceBuildersetTopics可以訂閱一系列的topic,使用的subscriber就是這個。
  • TopicPatternSubscriber: 使用正則表達式匹配topic名稱的方式獲取訂閱的partition。使用KafkaSourceBuildersetTopicPattern方法的時候會創建此subscriber。

接下來以TopicListSubscriber爲例,分析獲取訂閱partiton的邏輯。

@Override
public Set<TopicPartition> getSubscribedTopicPartitions(AdminClient adminClient) {
    LOG.debug("Fetching descriptions for topics: {}", topics);
    // 使用admin client讀取Kafka topic的元數據
    // 包含指定topic對應的分區信息
    final Map<String, TopicDescription> topicMetadata =
            getTopicMetadata(adminClient, new HashSet<>(topics));

    // 將各個分區的partition信息加入到subscribedPartitions集合,然後返回
    Set<TopicPartition> subscribedPartitions = new HashSet<>();
    for (TopicDescription topic : topicMetadata.values()) {
        for (TopicPartitionInfo partition : topic.partitions()) {
            subscribedPartitions.add(new TopicPartition(topic.name(), partition.partition()));
        }
    }

    return subscribedPartitions;
}

獲取訂閱分區的邏輯不是特別複雜,其他兩個subscriber的邏輯這裏不再分析。

getSubscribedTopicPartitions方法的返回值和異常(如果拋出異常的話)將會傳遞給checkPartitionChange方法。它將檢測分區信息是否發生變更。代碼邏輯如下:

private void checkPartitionChanges(Set<TopicPartition> fetchedPartitions, Throwable t) {
    if (t != null) {
        throw new FlinkRuntimeException(
                "Failed to list subscribed topic partitions due to ", t);
    }
    // 檢測分區變更情況
    final PartitionChange partitionChange = getPartitionChange(fetchedPartitions);
    // 如果沒有變更,直接返回
    if (partitionChange.isEmpty()) {
        return;
    }
    // 如果檢測到變更,調用initializePartitionSplits和handlePartitionSplitChanges方法
    context.callAsync(
            () -> initializePartitionSplits(partitionChange),
            this::handlePartitionSplitChanges);
}

@VisibleForTesting
PartitionChange getPartitionChange(Set<TopicPartition> fetchedPartitions) {
    // 保存被移除的分區
    final Set<TopicPartition> removedPartitions = new HashSet<>();
    Consumer<TopicPartition> dedupOrMarkAsRemoved =
            (tp) -> {
                if (!fetchedPartitions.remove(tp)) {
                    removedPartitions.add(tp);
                }
            };
    // 如果分區在assignedPartitions(已分配分區)存在,在fetchedPartitions中不存在,說明分區已經移出
    // 將其加入到removedPartitions中
    assignedPartitions.forEach(dedupOrMarkAsRemoved);
    // pendingPartitionSplitAssignment爲上輪發現但是還沒有分配給reader讀取的分區
    // 從pendingPartitionSplitAssignment中找到被移除的分區
    pendingPartitionSplitAssignment.forEach(
            (reader, splits) ->
                    splits.forEach(
                            split -> dedupOrMarkAsRemoved.accept(split.getTopicPartition())));

    // 如果fetchedPartitions還有分區沒有remove掉,說明有新發現的分區
    if (!fetchedPartitions.isEmpty()) {
        LOG.info("Discovered new partitions: {}", fetchedPartitions);
    }
    if (!removedPartitions.isEmpty()) {
        LOG.info("Discovered removed partitions: {}", removedPartitions);
    }

    // 包裝新增加的分區和移除的分區到PartitionChange中返回
    return new PartitionChange(fetchedPartitions, removedPartitions);
}

對比完新發現的分區和原本訂閱的分區之後,接下來需要對這些變更做出響應。

initializePartitionSplits方法將分區變更信息包裝爲PartitionSplitChange。這個對象記錄了新增加的分區和移除的分區。和PartitionChange不同的是,PartitionSplitChange包含的新增分區的類型爲KafkaPartitionSplit,它額外保存了分區的起始和終止offset。

private PartitionSplitChange initializePartitionSplits(PartitionChange partitionChange) {
    // 獲取新增的分區
    Set<TopicPartition> newPartitions =
            Collections.unmodifiableSet(partitionChange.getNewPartitions());
    // 獲取分區offset獲取器
    OffsetsInitializer.PartitionOffsetsRetriever offsetsRetriever = getOffsetsRetriever();

    // 獲取起始offset
    Map<TopicPartition, Long> startingOffsets =
            startingOffsetInitializer.getPartitionOffsets(newPartitions, offsetsRetriever);
    // 獲取終止offset
    Map<TopicPartition, Long> stoppingOffsets =
            stoppingOffsetInitializer.getPartitionOffsets(newPartitions, offsetsRetriever);

    Set<KafkaPartitionSplit> partitionSplits = new HashSet<>(newPartitions.size());
    // 將每個分區對應的starting offset和stopping offset包裝起來
    for (TopicPartition tp : newPartitions) {
        Long startingOffset = startingOffsets.get(tp);
        long stoppingOffset =
                stoppingOffsets.getOrDefault(tp, KafkaPartitionSplit.NO_STOPPING_OFFSET);
        partitionSplits.add(new KafkaPartitionSplit(tp, startingOffset, stoppingOffset));
    }
    // 返回結果
    return new PartitionSplitChange(partitionSplits, partitionChange.getRemovedPartitions());
}

上面的方法的關鍵邏輯是獲取各個分區的起始offset(startingOffsetInitializer)和終止offset(KafkaSourceBuilder)。

startingOffsetInitializerKafkaSourceBuilder中創建,默認爲OffsetsInitializer.earliest()。代碼如下:

static OffsetsInitializer earliest() {
    return new ReaderHandledOffsetsInitializer(
            KafkaPartitionSplit.EARLIEST_OFFSET, OffsetResetStrategy.EARLIEST);
}

它創建出ReaderHandledOffsetsInitializer對象,含義是對於所有新發現的topic,從它們最開頭的地方開始讀取。

ReaderHandledOffsetsInitializergetPartitionOffsets方法代碼內容如下。它將所有的分區offset設置爲startingOffset,結合前面的場景,即KafkaPartitionSplit.EARLIEST_OFFSET

@Override
public Map<TopicPartition, Long> getPartitionOffsets(
        Collection<TopicPartition> partitions,
        PartitionOffsetsRetriever partitionOffsetsRetriever) {
    Map<TopicPartition, Long> initialOffsets = new HashMap<>();
    for (TopicPartition tp : partitions) {
        initialOffsets.put(tp, startingOffset);
    }
    return initialOffsets;
}

對於stoppingOffsetInitializerKafkaSourceBuilder創建的默認爲NoStoppingOffsetsInitializer。含義爲沒有終止offset,針對unbounded(無界)kafka數據流。它的代碼很少,這裏就不再分析了。

我們回到應對分區變更的方法handlePartitionSplitChanges。這個方法將新發現的分區分配給pending和已註冊的reader。

private void handlePartitionSplitChanges(
        PartitionSplitChange partitionSplitChange, Throwable t) {
    if (t != null) {
        throw new FlinkRuntimeException("Failed to initialize partition splits due to ", t);
    }
    if (partitionDiscoveryIntervalMs < 0) {
        LOG.debug("Partition discovery is disabled.");
        noMoreNewPartitionSplits = true;
    }
    // TODO: Handle removed partitions.
    addPartitionSplitChangeToPendingAssignments(partitionSplitChange.newPartitionSplits);
    assignPendingPartitionSplits(context.registeredReaders().keySet());
}

addPartitionSplitChangeToPendingAssignments將分區加入到待讀取(pending)集合中。

private void addPartitionSplitChangeToPendingAssignments(
        Collection<KafkaPartitionSplit> newPartitionSplits) {
    int numReaders = context.currentParallelism();
    for (KafkaPartitionSplit split : newPartitionSplits) {
        // 將這些分區均勻分配給所有的reader
        int ownerReader = getSplitOwner(split.getTopicPartition(), numReaders);
        pendingPartitionSplitAssignment
                .computeIfAbsent(ownerReader, r -> new HashSet<>())
                .add(split);
    }
    LOG.debug(
            "Assigned {} to {} readers of consumer group {}.",
            newPartitionSplits,
            numReaders,
            consumerGroupId);
}

assignPendingPartitionSplits方法分配分區給reader。它的邏輯分析如下:

private void assignPendingPartitionSplits(Set<Integer> pendingReaders) {
    Map<Integer, List<KafkaPartitionSplit>> incrementalAssignment = new HashMap<>();

    // Check if there's any pending splits for given readers
    for (int pendingReader : pendingReaders) {
        // 檢查reader是否已在SourceCoordinator中註冊
        checkReaderRegistered(pendingReader);

        // Remove pending assignment for the reader
        // 獲取這個reader對應的所有分配給它的分區,然後從pendingPartitionSplitAssignment中移除
        final Set<KafkaPartitionSplit> pendingAssignmentForReader =
                pendingPartitionSplitAssignment.remove(pendingReader);

        // 如果有分配給這個reader的分區,將他們加入到incrementalAssignment中
        if (pendingAssignmentForReader != null && !pendingAssignmentForReader.isEmpty()) {
            // Put pending assignment into incremental assignment
            incrementalAssignment
                    .computeIfAbsent(pendingReader, (ignored) -> new ArrayList<>())
                    .addAll(pendingAssignmentForReader);

            // Mark pending partitions as already assigned
            // 標記這些分區爲已分配
            pendingAssignmentForReader.forEach(
                    split -> assignedPartitions.add(split.getTopicPartition()));
        }
    }

    // Assign pending splits to readers
    // 將這些分區分配給reader
    if (!incrementalAssignment.isEmpty()) {
        LOG.info("Assigning splits to readers {}", incrementalAssignment);
        context.assignSplits(new SplitsAssignment<>(incrementalAssignment));
    }

    // If periodically partition discovery is disabled and the initializing discovery has done,
    // signal NoMoreSplitsEvent to pending readers
    // 如果沒有新的分片(分區發現被關閉),並且設置爲有界模式
    // 給reader發送沒有更多分片信號(signalNoMoreSplits)
    if (noMoreNewPartitionSplits && boundedness == Boundedness.BOUNDED) {
        LOG.debug(
                "No more KafkaPartitionSplits to assign. Sending NoMoreSplitsEvent to reader {}"
                        + " in consumer group {}.",
                pendingReaders,
                consumerGroupId);
        pendingReaders.forEach(context::signalNoMoreSplits);
    }
}

調用assignPendingPartitionSplits方法的地方有三處:

  • addSplitsBack: 某個reader執行失敗,在上次成功checkpoint之後分配給這個reader的split需要再添加回SplitEnumerator中。
  • addReader: 增加新的reader。需要給新的reader分配split。
  • handlePartitionSplitChanges: 上面介紹的檢測到分區變更的時候,需要爲reader分配新發現的分區。

接着我們關心的問題是這些分片是如何添加給SplitEnumerator的。我們展開分析context.assignSplits調用。這裏的context實現類爲SourceCoordinatorContext。繼續分析SourceCoordinatorContext::assignSplits方法代碼:

@Override
public void assignSplits(SplitsAssignment<SplitT> assignment) {
    // Ensure the split assignment is done by the coordinator executor.
    // 在SourceCoordinator線程中調用
    callInCoordinatorThread(
            () -> {
                // Ensure all the subtasks in the assignment have registered.
                // 逐個檢查需要分配的split所屬的reader是否已註冊過
                assignment
                        .assignment()
                        .forEach(
                                (id, splits) -> {
                                    if (!registeredReaders.containsKey(id)) {
                                        throw new IllegalArgumentException(
                                                String.format(
                                                        "Cannot assign splits %s to subtask %d because the subtask is not registered.",
                                                        splits, id));
                                    }
                                });

                // 記錄已分配的assignment(加入到尚未checkpoint的assignment集合中)
                assignmentTracker.recordSplitAssignment(assignment);
                // 分配split
                assignSplitsToAttempts(assignment);
                return null;
            },
            String.format("Failed to assign splits %s due to ", assignment));
}

assignSplitsToAttempts有好幾個重載方法。一路跟隨到最後,它創建出了AddSplitEvent對象,通過OperatorCoordinator發送這個事件給SourceOperator。代碼如下所示:

private void assignSplitsToAttempts(SplitsAssignment<SplitT> assignment) {
    assignment.assignment().forEach((index, splits) -> assignSplitsToAttempts(index, splits));
}

private void assignSplitsToAttempts(int subtaskIndex, List<SplitT> splits) {
    getRegisteredAttempts(subtaskIndex)
            .forEach(attempt -> assignSplitsToAttempt(subtaskIndex, attempt, splits));
}

private void assignSplitsToAttempt(int subtaskIndex, int attemptNumber, List<SplitT> splits) {
    if (splits.isEmpty()) {
        return;
    }

    checkAttemptReaderReady(subtaskIndex, attemptNumber);

    final AddSplitEvent<SplitT> addSplitEvent;
    try {
        // 創建AddSplitEvent(添加split事件)
        addSplitEvent = new AddSplitEvent<>(splits, splitSerializer);
    } catch (IOException e) {
        throw new FlinkRuntimeException("Failed to serialize splits.", e);
    }

    final OperatorCoordinator.SubtaskGateway gateway =
            subtaskGateways.getGatewayAndCheckReady(subtaskIndex, attemptNumber);
    // 將事件發送給subtaskIndex對應的SourceOperator
    gateway.sendEvent(addSplitEvent);
}

gateway.sendEvent() -> SourceOperator::handleOperatorEvent

網絡通信之間的過程這裏不再分析了。我們查看SourceOperator接收event的方法handleOperatorEvent,內容如下:

public void handleOperatorEvent(OperatorEvent event) {
    if (event instanceof WatermarkAlignmentEvent) {
        updateMaxDesiredWatermark((WatermarkAlignmentEvent) event);
        checkWatermarkAlignment();
        checkSplitWatermarkAlignment();
    } else if (event instanceof AddSplitEvent) {
        handleAddSplitsEvent(((AddSplitEvent<SplitT>) event));
    } else if (event instanceof SourceEventWrapper) {
        sourceReader.handleSourceEvents(((SourceEventWrapper) event).getSourceEvent());
    } else if (event instanceof NoMoreSplitsEvent) {
        sourceReader.notifyNoMoreSplits();
    } else {
        throw new IllegalStateException("Received unexpected operator event " + event);
    }
}

如果接收到的事件類型爲AddSplitEvent,調用handleAddSplitsEvent方法。分析如下:

private void handleAddSplitsEvent(AddSplitEvent<SplitT> event) {
    try {
        // 反序列化得到split信息
        List<SplitT> newSplits = event.splits(splitSerializer);
        numSplits += newSplits.size();
        // 如果下游output還沒有初始化,加入到pending集合中緩存起來
        // 否則創建output,將split分配給這些output
        if (operatingMode == OperatingMode.OUTPUT_NOT_INITIALIZED) {
            // For splits arrived before the main output is initialized, store them into the
            // pending list. Outputs of these splits will be created once the main output is
            // ready.
            outputPendingSplits.addAll(newSplits);
        } else {
            // Create output directly for new splits if the main output is already initialized.
            createOutputForSplits(newSplits);
        }
        // 將split添加到sourceReader
        sourceReader.addSplits(newSplits);
    } catch (IOException e) {
        throw new FlinkRuntimeException("Failed to deserialize the splits.", e);
    }
}

最後我們一路跟蹤到SourceReaderBaseaddSplits方法。

@Override
public void addSplits(List<SplitT> splits) {
    LOG.info("Adding split(s) to reader: {}", splits);
    // Initialize the state for each split.
    splits.forEach(
            s ->
                    splitStates.put(
                            s.splitId(), new SplitContext<>(s.splitId(), initializedState(s))));
    // Hand over the splits to the split fetcher to start fetch.
    splitFetcherManager.addSplits(splits);
}

它把split交給splitFetcherManager執行。在本篇KafkaSource環境下它的實現類爲KafkaSourceFetcherManager。它的addSplits方法位於父類SingleThreadFetcherManager中。

分析到這裏,我們回到了上一節"數據讀取流程"的開頭"添加分片"方法。至此KafkaSource分區發現邏輯分析完畢。

Checkpoint邏輯

KafkaSourceReadersnapshotState方法返回當前需要checkpoint的分片信息,即Reader分配的分片。如果用戶配置了commit.offsets.on.checkpoint=true,保存各個分片對應的分區和offset分區到offsetsToCommit中。

@Override
public List<KafkaPartitionSplit> snapshotState(long checkpointId) {
    // 獲取分配給當前Reader的分片(checkpointId參數實際上沒有用到)
    List<KafkaPartitionSplit> splits = super.snapshotState(checkpointId);
    // 由配置項commit.offsets.on.checkpoint決定
    // 是否在checkpoint的時候,提交offset
    if (!commitOffsetsOnCheckpoint) {
        return splits;
    }

    // 下面邏輯只有在開啓commit.offsets.on.checkpoint的時候纔會執行
    // offsetToCommit保存了需要commit的offset信息
    // 是一個Map<checkpointID, Map<partition, offset>>數據結構
    // 如果當前Reader沒有分片,並且也沒有讀取完畢的分片
    // offsetsToCommit記錄checkpoint id對應一個空的map
    if (splits.isEmpty() && offsetsOfFinishedSplits.isEmpty()) {
        offsetsToCommit.put(checkpointId, Collections.emptyMap());
    } else {
        // 爲當前checkpoint id創建一個offsetMap,保存在offsetsToCommit中
        Map<TopicPartition, OffsetAndMetadata> offsetsMap =
                offsetsToCommit.computeIfAbsent(checkpointId, id -> new HashMap<>());
        // Put the offsets of the active splits.
        // 遍歷splits,保存split對應的分區和offset到offsetsMap中
        for (KafkaPartitionSplit split : splits) {
            // If the checkpoint is triggered before the partition starting offsets
            // is retrieved, do not commit the offsets for those partitions.
            if (split.getStartingOffset() >= 0) {
                offsetsMap.put(
                        split.getTopicPartition(),
                        new OffsetAndMetadata(split.getStartingOffset()));
            }
        }
        // 保存所有完成讀取的split的partition和offset信息
        // Put offsets of all the finished splits.
        offsetsMap.putAll(offsetsOfFinishedSplits);
    }
    return splits;
}

notifyCheckpointComplete方法。該方法在checkpoint完畢的時候執行。由SourceCoordinator發送checkpoint完畢通知。在這個方法中Kafka數據源提交Kafka offset。

@Override
public void notifyCheckpointComplete(long checkpointId) throws Exception {
    LOG.debug("Committing offsets for checkpoint {}", checkpointId);
    // 同上,如果沒有啓用checkpoint時候提交offset的配置,方法退出,什麼也不做
    if (!commitOffsetsOnCheckpoint) {
        return;
    }

    // 從offsetsToCommit中獲取當前checkpoint需要提交的分區offset信息
    Map<TopicPartition, OffsetAndMetadata> committedPartitions =
            offsetsToCommit.get(checkpointId);
    // 如果爲空,退出
    if (committedPartitions == null) {
        LOG.debug(
                "Offsets for checkpoint {} either do not exist or have already been committed.",
                checkpointId);
        return;
    }

    // 調用KafkaSourceFetcherManager,提交offset到kafka
    // 稍後分析
    ((KafkaSourceFetcherManager) splitFetcherManager)
            .commitOffsets(
                    committedPartitions,
                    (ignored, e) -> {
                        // The offset commit here is needed by the external monitoring. It won't
                        // break Flink job's correctness if we fail to commit the offset here.
                        // 這裏是提交offset的回調函數
                        // 如果遇到錯誤,監控指標記錄下失敗的提交
                        if (e != null) {
                            kafkaSourceReaderMetrics.recordFailedCommit();
                            LOG.warn(
                                    "Failed to commit consumer offsets for checkpoint {}",
                                    checkpointId,
                                    e);
                        } else {
                            LOG.debug(
                                    "Successfully committed offsets for checkpoint {}",
                                    checkpointId);
                            // 監控指標記錄成功的提交
                            kafkaSourceReaderMetrics.recordSucceededCommit();
                            // If the finished topic partition has been committed, we remove it
                            // from the offsets of the finished splits map.
                            committedPartitions.forEach(
                                    (tp, offset) ->
                                            kafkaSourceReaderMetrics.recordCommittedOffset(
                                                    tp, offset.offset()));
                            // 由於offset已提交,從已完成split集合中移除
                            offsetsOfFinishedSplits
                                    .entrySet()
                                    .removeIf(
                                            entry ->
                                                    committedPartitions.containsKey(
                                                            entry.getKey()));
                            // 移除當前以及之前的checkpoint id對應的offset信息,因爲已經commit過,無需再保存
                            while (!offsetsToCommit.isEmpty()
                                    && offsetsToCommit.firstKey() <= checkpointId) {
                                offsetsToCommit.remove(offsetsToCommit.firstKey());
                            }
                        }
                    });
}

接下來我們關注KafkaSourceFetcherManager。這個類負責向KafkaConsumer提交offset,邏輯對應commitOffsets方法,內容如下:

public void commitOffsets(
        Map<TopicPartition, OffsetAndMetadata> offsetsToCommit, OffsetCommitCallback callback) {
    LOG.debug("Committing offsets {}", offsetsToCommit);
    // 如果沒有offset需要commit,返回
    if (offsetsToCommit.isEmpty()) {
        return;
    }
    // 獲取正在運行的SplitFetcher
    SplitFetcher<ConsumerRecord<byte[], byte[]>, KafkaPartitionSplit> splitFetcher =
            fetchers.get(0);
    if (splitFetcher != null) {
        // The fetcher thread is still running. This should be the majority of the cases.
        // 如果fetcher仍在運行,創建提交offset的任務,加入隊列
        enqueueOffsetsCommitTask(splitFetcher, offsetsToCommit, callback);
    } else {
        // 如果沒有SplitFetcher運行,創建一個新的SplitFetcher
        // 和上面異常創建任務之後,啓動這個SplitFetcher
        splitFetcher = createSplitFetcher();
        enqueueOffsetsCommitTask(splitFetcher, offsetsToCommit, callback);
        startFetcher(splitFetcher);
    }
}

繼續分析創建offset提交任務的方法。代碼如下:

private void enqueueOffsetsCommitTask(
        SplitFetcher<ConsumerRecord<byte[], byte[]>, KafkaPartitionSplit> splitFetcher,
        Map<TopicPartition, OffsetAndMetadata> offsetsToCommit,
        OffsetCommitCallback callback) {
    // 獲取splitFetcher對應的KafkaReader
    KafkaPartitionSplitReader kafkaReader =
            (KafkaPartitionSplitReader) splitFetcher.getSplitReader();

    爲fetcher創建一個SplitFetcherTask
    splitFetcher.enqueueTask(
            new SplitFetcherTask() {
                @Override
                public boolean run() throws IOException {
                    kafkaReader.notifyCheckpointComplete(offsetsToCommit, callback);
                    return true;
                }

                @Override
                public void wakeUp() {}
            });
}

到此,一個SplitFetcherTask已被添加到SplitFetchertaskQueue中。根據我們在前面"數據讀取流程"分析的結論,SplitFetcher通過runOnce方法逐個讀取taskQueue中排隊的任務執行。當它取出SplitFetcherTask時,會運行它的run方法。調用kafkaReader.notifyCheckpointComplete方法。這個方法調用KafkaConsumer的異步提交offset方法commitAsync

public void notifyCheckpointComplete(
        Map<TopicPartition, OffsetAndMetadata> offsetsToCommit,
        OffsetCommitCallback offsetCommitCallback) {
    consumer.commitAsync(offsetsToCommit, offsetCommitCallback);
}

到這裏,KafkaSource checkpoint提交offset的過程分析完畢。

本博客爲作者原創,歡迎大家參與討論和批評指正。如需轉載請註明出處。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章