Kafka Streams實戰-流和狀態

有狀態操作
使用狀態存儲
連接兩個流
Kafka Streams的timestamps

1. 有狀態操作

1.1 轉換處理器

KStream.transformValues是最基本的有狀態方法，下圖展示了它工作的原理：

此方法在語義上與KStream.mapValues方法相同，但主要的區別是transformValues可以訪問狀態存儲實例來完成其任務。

1.2 初始化轉換器

在上一篇的開發入門講述的ZMart應用程序裏面，Rewards節點使用KStream.mapValues方法把Purchase對象映射爲RewardAccumulator對象，用於計算積分獎勵。但爲了計算累計積分，需要保存每次的消費積分。KStream.transformValues方法的第一個參數是一個接口ValueTransformerSupplier<? super V, ? extends VR>，需要實現它創建一個ValueTransformer<V, VR>轉換器的實例。下面是示例代碼，使用狀態存儲KeyValueStore保存累積的積分：


public class PurchaseRewardTransformer implements ValueTransformer<Purchase, RewardAccumulator> {
 
    // 狀態存儲
    private KeyValueStore<String, Integer> stateStore;
    private final String storeName;
    private ProcessorContext context;
 
    public PurchaseRewardTransformer(String storeName) {
        Objects.requireNonNull(storeName, "Store Name can't be null");
        this.storeName = storeName;
    }
 
    @Override
    @SuppressWarnings("unchecked")
    public void init(ProcessorContext context) {
        this.context = context;
        // 初始化狀態存儲KeyValueStore
        stateStore = (KeyValueStore<String, Integer>) this.context.getStateStore(storeName);
    }
 
    @Override
    public RewardAccumulator transform(Purchase value) {
        // TODO
        return null;
    }
 
    @Override
    public void close() {
    }
 
}

下面是ValueTransformerSupplier的實現類，用於返回PurchaseRewardTransformer實例：


public class PurchaseTransformerSupplier implements ValueTransformerSupplier<Purchase, RewardAccumulator> {
    
    private final String storeName;
    private PurchaseRewardTransformer rewardTransformer;
    
    public PurchaseTransformerSupplier(String storeName) {
        Objects.requireNonNull(storeName, "Store Name can't be null");
        this.storeName = storeName;
        this.rewardTransformer = new PurchaseRewardTransformer(this.storeName);
    }
 
    @Override
    public ValueTransformer<Purchase, RewardAccumulator> get() {
        return this.rewardTransformer;
    }
 
}

1.3 實現transform方法

實現PurchaseRewardTransformer.transform方法把Purchase對象轉換爲RewardAccumulator對象：


@Override
public RewardAccumulator transform(Purchase value) {
    RewardAccumulator rewardAccumulator = RewardAccumulator.builder(value).build();
    // 通過客戶ID讀取保存的歷史積分
    Integer accumulatedSoFar = stateStore.get(rewardAccumulator.getCustomerId());
    // 計算總積分
    if (accumulatedSoFar != null) {
        rewardAccumulator.addRewardPoints(accumulatedSoFar);
    }
    // 更新總積分
    stateStore.put(rewardAccumulator.getCustomerId(), rewardAccumulator.getTotalRewardPoints());
    return rewardAccumulator;
}

需要注意的是，在Kafka集羣模式下，消費數據在沒有指定key的情況下是按照round-robin模式分配到不同的分區，所以具有相同客戶ID的數據不會全部在同一個分區。如下圖所示：

因爲分區是通過StreamTask管理的，而每個StreamTask都有自己的狀態存儲。因此把具有相同客戶ID的數據分配到相同的分區是非常重要的，以便它們可以被保存在同一個狀態存儲裏。爲了解決此問題，我們需要按客戶ID重新分區數據。

1.4 重新分區數據

要重新分區數據，可以修改原來數據的key值，然後把數據寫入一個新的topic。如下圖所示：

在這個簡單的例子中，我們使用了一個具體的key值替換了null，但重新分區不必總是修改key值。通過使用StreamPartitioner應用你可以想到的任何分區策略，例如對值或部分值進行分區。

1.5 在Kafka Streams中重新分區

使用KStream.through()方法可以容易地在Kafka Streams中實現重新分區，如下圖所示。該方法創建了一箇中間topic，當前的KStream實例會把數據寫入這個中間topic。KStream.through()方法返回的新KStream實例會從這個中間topic讀取數據，這樣，數據就可以無縫地重新分區。

該方法的內部實現是創建了一個sink和source節點，sink節點是KStream實例的子處理器，而新的KStream實例使用新的source節點作爲其數據源。你可以使用DSL創建相同類型的子拓撲，但使用KStream.through()方法更方便。下面是使用了默認分區器的示例代碼：


KStream<String, Purchase> transByCustomerStream = purchaseKStream.through("customer_transactions",
    // 使用默認分區器DefaultPartitioner
    Produced.with(stringSerde, purchaseSerde));

1.6 使用StreamPartitioner

如果不想使用默認的分區器，可以自定義化，只要實現接口StreamPartitioner：


public class RewardsStreamPartitioner implements StreamPartitioner<String, Purchase> {
 
    @Override
    public Integer partition(String topic, String key, Purchase value, int numPartitions) {
        // 使用客戶ID作爲分區策略，以便具有相同客戶ID的數據會在同一個分區
        return value.getCustomerId().hashCode() % numPartitions;
    }
 
}

然後更新代碼使用該自定義分區器：


RewardsStreamPartitioner streamPartitioner = new RewardsStreamPartitioner();
KStream<String, Purchase> transByCustomerStream = purchaseKStream.through("customer_transactions",
    // 使用自定義分區器
    Produced.with(stringSerde, purchaseSerde, streamPartitioner));

1.7 更新處理拓撲

到目前爲止，我們已經創建了一個新的處理節點負責把消費數據按照客戶ID分區，這是爲了確保對相同客戶的所有消費數據都寫入同一分區。因此，對相同客戶的所有消費數據都會保存在相同的狀態存儲中。下圖是更新的處理拓撲，在Masking節點和Rewards處理器之間使用新的through處理器：

下面是更新的代碼：


String rewardsStateStoreName = "rewardsPointsStore";
KStream<String, RewardAccumulator> statefulRewardAccumulator = transByCustomerStream
    // 使用新的狀態轉換器
    .transformValues(new PurchaseTransformerSupplier(rewardsStateStoreName), rewardsStateStoreName);
statefulRewardAccumulator.to("rewards", Produced.with(stringSerde, rewardAccumulatorSerde));

2. 使用狀態存儲

2.1 數據局部性

數據局部性對性能是至關重要的。雖然通常利用key查找數據是非常快的，但是當數據達到一定規模時，使用遠程存儲帶來的延時通常會是一個瓶頸。下圖說明了數據局部性的重要性，虛線表示從遠程數據庫獲取數據，實線表示從同一個服務器上的內存數據存儲讀取數據，後者比前者更有效。

數據局部性還意味着存儲是每個處理節點的本地存儲，不存在跨進程或線程的共享。這樣，如果一個進程故障，它不應該對其它流處理進程或線程產生影響。

2.2 故障恢復和容錯

應用程序故障是不可避免的，特別是涉及分佈式應用程序。我們需要把注意力放在如何迅速恢復故障，而不是防止故障。下圖說明了數據局部性和容錯的原理，每個處理器都有其本地數據存儲和一個用於備份狀態存儲的changelog topic。使用topic備份狀態存儲看起來成本比較高，但這是爲了滿足容錯的需求，一旦進程故障或重啓，可以從該topic讀取數據進行快速恢復。

2.3 使用狀態存儲

添加狀態存儲是非常簡單的，就是使用Stores類中的一個靜態工廠方法創建StoreSupplier實例。還有兩個用於自定義狀態存儲的類：Materialized和StoreBuilder類，使用哪一個取決於把存儲添加到拓撲中的方式。如果使用high-level的DSL，通常會使用Materialized類；如果使用lower-level的Processor API，則通常會使用StoreBuilder。

即使當前的例子使用了high-level的DSL，但由於在上面的轉換器使用了狀態存儲，實際上是使用了lower-level的Processor API，所以這裏會使用StoreBuilder來自定義狀態存儲：


KeyValueBytesStoreSupplier storeSupplier = Stores.inMemoryKeyValueStore(rewardsStateStoreName);
StoreBuilder<KeyValueStore<String, Integer>> storeBuilder = Stores.keyValueStoreBuilder(storeSupplier,
    Serdes.String(), Serdes.Integer());
streamsBuilder.addStateStore(storeBuilder);

這樣，上面的PurchaseRewardTransformer轉換器就可以使用這個內存key-value存儲。

2.4 其它key/value存儲供應商

除了Stores.inMemoryKeyValueStore方法之外，還可以使用下面這些靜態工廠方法來生成存儲供應商：

Stores.persistentKeyValueStore
Stores.lruMap
Stores.persistentWindowStore
Stores.persistentSessionStore

值得注意的是，所有持久化的StateStore實例都使用RocksDB提供本地存儲。

2.5 StateStore容錯

所有StateStoreSupplier類型都默認啓用日誌記錄，它是作爲changelog的一個Kafka的topic，用於備份存儲中的值和提供容錯功能。例如，假設有一臺運行Kafka Streams的服務器故障，當恢復和重啓Kafka Streams應用程序後，該實例的狀態存儲將恢復爲原始內容（故障前在changelog最後提交的offset）。該日誌記錄功能可以使用StoreBuilder.withLoggingDisabled()方法禁用，但不建議使用。

2.6 配置changelog topics

Kafka Streams會自動創建changelog的topic，它是一個compacted的topic。如果想從狀態存儲刪除數據，可以使用put(key, null)方法，把需要刪除的值設爲null。數據保留的默認設置是一個星期，且大小不受限制，默認清除的策略是delete。

下面讓我們看看如何配置changelog的topic，使其保留數據大小爲10GB，保留時間爲2天，清除策略是先compact再delete：


Map<String, String> changeLogConfigs = new HashMap<String, String>();
changeLogConfigs.put("log.retention.hours", "48");
changeLogConfigs.put("log.retention.bytes", "10000000000");
changeLogConfigs.put("log.cleanup.policy", "compact,delete");
 
KeyValueBytesStoreSupplier storeSupplier = Stores.inMemoryKeyValueStore("foo");
StoreBuilder<KeyValueStore<String, Integer>> storeBuilder = Stores.keyValueStoreBuilder(storeSupplier,
    Serdes.String(), Serdes.String());
// 使用StoreBuilder
storeBuilder.withLoggingEnabled(changeLogConfigs);
// 使用Materialized
Materialized.as(storeSupplier);

3. 連接兩個流

現在ZMart他們希望通過贈送咖啡店的優惠券來保持電子商店的客流量（希望增加的客流量能提高銷售量）。他們希望能識別在某段時間內同時購買咖啡和電子產品的顧客，並在第二次的消費後馬上贈送優惠券，見下圖：

3.1 生成包含客戶ID的key值

要確定何時贈送優惠券，需要連接咖啡店和電子商店的數據流。而爲了連接它們，需要生成連接的key（這裏使用客戶ID）和拆分咖啡店和電子商店的數據：


// 使用客戶ID重新生成分區key
KStream<String, Purchase> kstreamByKey = purchaseKStream.selectKey((key, purchase) -> purchase.getCustomerId());
 
// 拆分咖啡店和電子商店的數據
@SuppressWarnings("unchecked")
KStream<String, Purchase>[] branchesStream = kstreamByKey.branch(
        (key, purchase) -> purchase.getDepartment().equalsIgnoreCase("coffee"),
        (key, purchase) -> purchase.getDepartment().equalsIgnoreCase("electronics"));
        
KStream<String, Purchase> coffeeStream = branchesStream[0];
KStream<String, Purchase> electronicsStream = branchesStream[1];

注意KStream.selectKey方法會觸發數據重新分區。下圖是更新的處理拓撲：

3.2 創建連接器

內連接兩個流可以使用KStream.join方法，它的第二個參數是ValueJoiner的一個實例，所以要先創建一個連接器，實現其接口方法apply：


public class PurchaseJoiner implements ValueJoiner<Purchase, Purchase, CorrelatedPurchase> {
 
    @Override
    public CorrelatedPurchase apply(Purchase purchase, Purchase otherPurchase) {
        CorrelatedPurchase.Builder builder = CorrelatedPurchase.newBuilder();
 
        Date purchaseDate = purchase != null ? purchase.getPurchaseDate() : null;
        Double price = purchase != null ? purchase.getPrice() : 0.0;
        String itemPurchased = purchase != null ? purchase.getItemPurchased() : null;
 
        Date otherPurchaseDate = otherPurchase != null ? otherPurchase.getPurchaseDate() : null;
        Double otherPrice = otherPurchase != null ? otherPurchase.getPrice() : 0.0;
        String otherItemPurchased = otherPurchase != null ? otherPurchase.getItemPurchased() : null;
 
        List<String> purchasedItems = new ArrayList<String>();
 
        if (itemPurchased != null) {
            purchasedItems.add(itemPurchased);
        }
 
        if (otherItemPurchased != null) {
            purchasedItems.add(otherItemPurchased);
        }
 
        String customerId = purchase != null ? purchase.getCustomerId() : null;
        String otherCustomerId = otherPurchase != null ? otherPurchase.getCustomerId() : null;
 
        builder.withCustomerId(customerId != null ? customerId : otherCustomerId)
                .withFirstPurchaseDate(purchaseDate)
                .withSecondPurchaseDate(otherPurchaseDate)
                .withItemsPurchased(purchasedItems)
                .withTotalAmount(price + otherPrice);
 
        return builder.build();
    }
 
}

連接返回的對象是CorrelatedPurchase：


import java.util.Date;
import java.util.List;
 
public class CorrelatedPurchase {
 
    private String customerId;
    private List<String> itemsPurchased;
    private double totalAmount;
    private Date firstPurchaseTime;
    private Date secondPurchaseTime;
 
    private CorrelatedPurchase(Builder builder) {
        customerId = builder.customerId;
        itemsPurchased = builder.itemsPurchased;
        totalAmount = builder.totalAmount;
        firstPurchaseTime = builder.firstPurchasedItem;
        secondPurchaseTime = builder.secondPurchasedItem;
    }
 
    public static Builder newBuilder() {
        return new Builder();
    }
 
    // 省略get方法
 
    @Override
    public String toString() {
        return "CorrelatedPurchase{" + "customerId='" + customerId + '\'' + ", itemsPurchased=" + itemsPurchased
                + ", totalAmount=" + totalAmount + ", firstPurchaseTime=" + firstPurchaseTime + ", secondPurchaseTime="
                + secondPurchaseTime + '}';
    }
 
    public static final class Builder {
        private String customerId;
        private List<String> itemsPurchased;
        private double totalAmount;
        private Date firstPurchasedItem;
        private Date secondPurchasedItem;
 
        private Builder() {
        }
 
        public Builder withCustomerId(String val) {
            customerId = val;
            return this;
        }
 
        public Builder withItemsPurchased(List<String> val) {
            itemsPurchased = val;
            return this;
        }
 
        public Builder withTotalAmount(double val) {
            totalAmount = val;
            return this;
        }
 
        public Builder withFirstPurchaseDate(Date val) {
            firstPurchasedItem = val;
            return this;
        }
 
        public Builder withSecondPurchaseDate(Date val) {
            secondPurchasedItem = val;
            return this;
        }
 
        public CorrelatedPurchase build() {
            return new CorrelatedPurchase(this);
        }
    }
 
}

3.3 內連接兩個流

這樣我們就可以調用KStream.join方法，內連接咖啡店和電子商店的數據流。下面是更新的拓撲：

連接代碼：


// 20分鐘連接窗口
JoinWindows twentyMinuteWindow = JoinWindows.of(60 * 1000 * 20);
KStream<String, CorrelatedPurchase> joinedKStream = coffeeStream.join(electronicsStream, new PurchaseJoiner(), twentyMinuteWindow,
        Joined.with(stringSerde, purchaseSerde, purchaseSerde));

本例使用20分鐘的連接窗口，時間發生先後沒有限制，只要兩者數據的timestamp相差在20分鐘以內。另外還有兩個指定發生先後的連接窗口：

JoinWindows.after：連接的數據發生在之後N毫秒內
JoinWindows.before：連接的數據發生在之前N毫秒內

注意：在執行連接之前，你需要確保所有連接的分區都是co-partitioned，也就是它們要有相同數量的分區和使用相同類型的分區key。因此，當調用join()方法時，兩個KStream的實例會被檢查是否需要重新分區。（當連接GlobalKTable實例時不需要重新分區）

在上述3.1示例代碼的purchaseKStream調用了selectKey()方法，並且在返回的KStreams馬上創建分支。因爲selectKey()方法修改了分區key，所以coffeeStream和electronicsStream都需要重新分區。值得重複的是，重新分區是必要的，因爲需要確保具有相同key的數據會被寫入同一個分區，這種重新分區是自動處理的。此外，當啓動Kafka Streams應用程序時，會檢查連接中涉及的topics以確保它們有相同數量的分區，如果發現數量不同會拋出TopologyBuilderException異常。開發人員有責任確保連接中涉及的key是同一類型的。

在寫入Kafka Streams源主題時，Co-partitioning還要求所有Kafka生產者使用相同的分區類。同樣地，你需要對通過KStream.to()方法寫入sink topics的任何操作使用相同的StreamPartitioner。如果使用默認的分區策略，則就無需擔心這個問題。

3.4 外連接

如果想使用外連接，可以使用：

coffeeStream.outerJoin(electronicsStream, ...)

下圖說明了外連接的三種可能結果：

3.5 左連接

如果想使用左連接，可以使用：

coffeeStream.leftJoin(electronicsStream, ...)

下圖說明了左連接的三種可能結果：

4. Kafka Streams的timestamps

Timestamps在Kafka Streams以下功能發揮了關鍵的作用：

連接流
更新一個changelog (KTable API)
決定Punctuator.punctuate()方法什麼時候被觸發 (Processor API)

(本文暫不介紹KTable和Processor的API) 在流處理系統中，timestamps可以分爲以下3種時間概念：

Event time：事件被創建的時間
Ingestion time：事件被保存在Kafka broker的時間
Processing time：流處理應用程序接收事件的時間

注意：到目前爲止，我們都是假定客戶和brokers在同一個時區，但實際情況並非總是如此。當使用timestamps時，使用UTC時區規範化時間是最安全的，這樣可以避免brokers和客戶的時區差異。

4.1 內置TimestampExtractor的實現

幾乎所有內置TimestampExtractor的實現都使用生產者或broker設置在消息metadata的timestamps。默認的timestamp配置（broker配置log.message.timestamp.type或topic配置message.timestamp.type）是CreateTime，可以修改爲LogAppendTime。ExtractRecordMetadataTimestamp是一個抽象類，它提供從ConsumerRecord對象讀取metadata timestamp的extract方法。大多數的實現類都是繼承這個類，重寫其onInvalidTimestamp這個抽象方法來處理無效的timestamps（當timestamps小於0）。

下面是繼承ExtractRecordMetadataTimestamp的類列表：

FailOnInvalidTimestamp：如果timestamp是無效的，拋出StreamsException異常
LogAndSkipOnInvalidTimestamp：如果timestamp是無效的，返回這個無效的timestamp並打印“由於timestamp無效而將丟棄該消息的警告信息”
UsePreviousTimeOnInvalidTimestamp：如果timestamp是無效的，返回最後一個有效的timestamp

4.2 WallclockTimestampExtractor

該實現類返回調用System.currentTimeMillis()方法的結果。

4.3 自定義TimestampExtractor

自定義TimestampExtractor只需要實現該接口和方法extract，下面是示例代碼，使用了購買的時間：


public class TransactionTimestampExtractor implements TimestampExtractor {
 
    @Override
    public long extract(ConsumerRecord<Object, Object> record, long previousTimestamp) {
        Purchase purchasePurchaseTransaction = (Purchase) record.value();
        return purchasePurchaseTransaction.getPurchaseDate().getTime();
    }
 
}

注意：日誌保留和滾動是基於timestamp的，還有自定義的TimestampExtractor返回的timestamp可能成爲changelogs和下游輸出topics使用的消息timestamp。

4.4 指定TimestampExtractor

指定TimestampExtractor有兩種選項，第一種選項是在設置Kafka Streams應用程序時在屬性中指定，這是全局的設置，默認設置是FailOnInvalidTimestamp。例如：

props.put(StreamsConfig.DEFAULT_TIMESTAMP_EXTRACTOR_CLASS_CONFIG, TransactionTimestampExtractor.class);

第二種選項是通過Consumed對象指定，例如：

Consumed.with(stringSerde, purchaseSerde).withTimestampExtractor(new TransactionTimestampExtractor());

這樣做的好處是每個輸入源都有一個TimestampExtractor，而第一種選項是使用一個TimestampExtractor處理來自不同topics的消息。

END O(∩_∩)O

Kafka Streams實戰-流和狀態

vue項目獲取富文本編輯器wangEditor內容導出爲word（html轉word格式並下載）

dotnet C# 創建 X11 應用時設置窗口背景顏色

Navicat安裝與激活教程

TDengine docker安裝方法

vue3組件通信與props

sapui5

Alpine Linux apk add DNS lookup error

部分JDK版本的發佈時間

工作中用到的腳本合集

合併代碼時Beyond Compare設置

helm簡介

各種邊緣檢測算子

機器學習中五種常用的聚類算法

HDFS的高可用機制詳解

Kafka的分區策略

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結