Flink1.9 Sate Processor API 介紹和實例demo

功能介紹

Flink1.9 新添加的功能,其能夠幫助用戶直接訪問Flink中存儲的State,API能夠幫助用戶非常方便地讀取、修改甚至重建整個State。這個功能的強大之處在於幾個方面,第一個就是靈活地讀取外部的數據,比如從一個數據庫中讀取自主地構建Savepoint,解決作業冷啓動問題,這樣就不用從N天前開始重跑整個數據

可以使用的場景

  • 異步校驗或者查看某個階段的狀態,一般而言,flink作業的最終結果都會持久化輸出,但在面臨問題的時候,如何確定哪一級出現問題,state processor api也提供了一種可能,去檢驗state中的數據是否與預期的一致。
  • 髒數據訂正,比如有一條髒數據污染了State,就可以用State Processor API對於狀態進行修復和訂正。
  • 狀態遷移,當用戶修改了作業邏輯,還想要複用原來作業中大部分的State,或者想要升級這個State的結構就可以用這個API來完成相應的工作。
  • 解決作業冷啓動問題,這樣就不用從N天前開始重跑整個數據。

一些限制點

  • window state暫時修改不了
  • 每個有狀態的算子都必須手動指定uid
  • 無法通過讀取savepoint 直接獲取到metadata 信息(existing operator ids)

關聯的知識點

State 分爲: 1: Operator States 2: Keyed States
在讀取state的時候需要根據對應的類型選擇不同的讀取方式

Operator States Keyed States
readListState readKeyedState
readUnionState
readBroadcastState

基於batch 熱加載數據生成Savepoint 和 Savepoint state 修改

最後會給出對應的兩個demo。
基本流程兩者比較類似

  • 基於batch 熱加載數據

    1: batch讀取數據 --> Dataset (比如讀取文本文件)
    2: 編寫業務邏輯處理數據 --> 獲取轉換後的DataSet(處理文本生成一個Tuple2<key, num>
    3: 將數據結果轉換爲state --> KeyedStateBootstrapFunction
    4: 生成外部Savepoint(注意對uid的指定和StateBackend 類型的選擇)
    • Savepoint state 修改
    1: 調用Savepoint.load 加載當前已經存在的Savepoint(注意StateBackend 必須和之前生成的任務一致)
    2: 調用 savepoint.readKeyedState 讀取獲取到的ExistingSavepoint,結果是一個DataSet數據集
    3:編寫Batch 業務邏輯調整生成的DataSet(比如刪除某個元素),其結果還算一個DataSet
    4: 自定義 KeyedStateBootstrapFunction 將數據結果轉換爲state
    5: 生成外部Savepoint(注意對uid的指定和StateBackend 類型的選擇)

基於batch 重新構建stream樣例

public static void main(String[] args) throws Exception {
        ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();

        //獲取外部離線數據源
        DataSource<String> textSource =  env.readTextFile("D:\\sources\\data.txt");
        DataSet<Tuple2<String, Integer>> sourceDataSet = textSource.flatMap(new FlatMapFunction<String, Tuple2<String, Integer>>() {

            @Override
            public void flatMap(String value, Collector<Tuple2<String, Integer>> out) throws Exception {
                String[] strArr = value.split(",");
                for (String str : strArr) {
                    Tuple2<String, Integer> worldTuple = new Tuple2<>(str, 1);
                    out.collect(worldTuple);
                }
            }
        });

        //計算出需要的歷史狀態
        DataSet<ReadAndModifyState.KeyedValueState> dataSet = sourceDataSet
                .groupBy(0)
                .reduceGroup(new GroupReduceFunction<Tuple2<String, Integer>, ReadAndModifyState.KeyedValueState>() {
            @Override
            public void reduce(Iterable<Tuple2<String, Integer>> values, Collector<ReadAndModifyState.KeyedValueState> out) throws Exception {

                Iterator iterator = values.iterator();
                Long countNum = 0L;
                String worldkey = null;
                while(iterator.hasNext()){
                    Tuple2<String, Integer> info = (Tuple2<String, Integer>) iterator.next();
                    if(worldkey == null){
                        worldkey = info.f0;
                    }
                    countNum++;
                }

                ReadAndModifyState.KeyedValueState keyedValueState = new ReadAndModifyState.KeyedValueState();
                keyedValueState.key = new Tuple1<>(worldkey);
                keyedValueState.countNum = countNum;

                out.collect(keyedValueState);
            }
        });

        //將歷史狀態轉換爲state 並轉換爲savepoint 寫入hdfs上
        BootstrapTransformation<ReadAndModifyState.KeyedValueState> transformation = OperatorTransformation
                .bootstrapWith(dataSet)
                .keyBy(new KeySelector<ReadAndModifyState.KeyedValueState, Tuple1<String>>() {
                    @Override
                    public Tuple1<String> getKey(ReadAndModifyState.KeyedValueState value) throws Exception {
                        return value.key;
                    }
                })
                .transform(new ReadAndModifyState.KeyedValueStateBootstrapper());

        String uid = "keyby_summarize";
        String savePointPath = "hdfs://ns1/user/xc/savepoint-from-batch";
        StateBackend rocksDBBackEnd = new RocksDBStateBackend("hdfs://ns1/user/xc");
        Savepoint.create(rocksDBBackEnd, 128)
                .withOperator(uid, transformation)
                .write(savePointPath);


        env.execute("batch build save point");
        System.out.println("-------end------------");
    }

讀取和修改樣例

 public static void main(String[] args) throws Exception {
        ExecutionEnvironment bEnv = ExecutionEnvironment.getExecutionEnvironment();
        String savePointPath = "hdfs://ns1/user/xc/savepoint-61b8e1-bbee958b3087";
        StateBackend rocksDBBackEnd = new RocksDBStateBackend("hdfs://ns1/user/xc");

        ExistingSavepoint savepoint = Savepoint.load(bEnv, savePointPath, rocksDBBackEnd);

        //讀取
        String uid = "keyby_summarize";
        DataSet<KeyedValueState> keyState = savepoint.readKeyedState(uid, new StateReaderFunc());

        //修改
        DataSet<KeyedValueState> dataSet = keyState.flatMap((FlatMapFunction<KeyedValueState, KeyedValueState>) (value, out) -> {
            value.countNum = value.countNum * 2;
            out.collect(value);
        }).returns(KeyedValueState.class);

        BootstrapTransformation<KeyedValueState> transformation = OperatorTransformation
                .bootstrapWith(dataSet)
                //注意keyby操作的key一定要和原來的相同
                .keyBy(new KeySelector<KeyedValueState, Tuple1<String>>() {
                    @Override
                    public Tuple1<String> getKey(KeyedValueState value) throws Exception {
                        return value.key;
                    }
                })
                .transform(new KeyedValueStateBootstrapper());

        Savepoint.create(rocksDBBackEnd, 128)
                .withOperator(uid, transformation)
                .write("hdfs://ns1/user/xc/savepoint-after-modify3");


        bEnv.execute("read the list state");
        System.out.println("-----end------------");
    }

    public static class StateReaderFunc extends KeyedStateReaderFunction<Tuple1<String>, KeyedValueState> {

        private static final long serialVersionUID = -3616180524951046897L;
        private transient ValueState<Long> state;

        @Override
        public void open(Configuration parameters) {
            ValueStateDescriptor currentCountDescriptor = new ValueStateDescriptor("currentCountState", Long.class);
            state = getRuntimeContext().getState(currentCountDescriptor);
        }

        @Override
        public void readKey(Tuple1<String> key, Context ctx, Collector<KeyedValueState> out) throws Exception {
            System.out.println(key.f0 +":" + state.value());

            KeyedValueState keyedValueState = new KeyedValueState();
            keyedValueState.key = new Tuple1<>(key.f0);
            keyedValueState.countNum = state.value();

            out.collect(keyedValueState);
        }
    }

    public static class KeyedValueState {
        Tuple1<String> key;
        Long countNum;
    }

    private static class KeyedValueStateBootstrapper extends KeyedStateBootstrapFunction<Tuple1<String>, KeyedValueState>{

        private static final long serialVersionUID = 1893716139133502118L;
        private ValueState<Long> currentCount = null;

        @Override
        public void open(Configuration parameters) throws Exception {
            ValueStateDescriptor currentCountDescriptor = new ValueStateDescriptor("currentCountState", Long.class, 0L);
            currentCount = getRuntimeContext().getState(currentCountDescriptor);
        }

        @Override
        public void processElement(KeyedValueState value, Context ctx) throws Exception {
            currentCount.update(value.countNum);
        }
    }
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章