Flink落HDFS數據按事件時間分區解決方案

原創

2019-09-29 21:13

0x1 摘要

Hive離線數倉中爲了查詢分析方便，幾乎所有表都會劃分分區，最爲常見的是按天分區，Flink通過以下配置把數據寫入HDFS，

BucketingSink<Object> sink = new BucketingSink<>(path);
//通過這樣的方式來實現數據跨天分區
sink.setBucketer(new DateTimeBucketer<>("yyyy/MM/dd"));
sink.setWriter(new StringWriter<>());
sink.setBatchSize(1024 * 1024 * 256L);
sink.setBatchRolloverInterval(30 * 60 * 1000L);
sink.setInactiveBucketThreshold(3 * 60 * 1000L);
sink.setInactiveBucketCheckInterval(30 * 1000L);
sink.setInProgressSuffix(".in-progress");
sink.setPendingSuffix(".pending");

0x2 問題點

如果要做到數據完全正確的落到相應分區，那必須用eventTime來劃分，我們先來看看DateTimeBucketer桶實現代碼，

public class DateTimeBucketer<T> implements Bucketer<T> {
 private static final long serialVersionUID = 1L;
 private static final String DEFAULT_FORMAT_STRING = "yyyy-MM-dd--HH";
 private final String formatString;
 private final ZoneId zoneId;
 private transient DateTimeFormatter dateTimeFormatter;

 /**
  * Creates a new {@code DateTimeBucketer} with format string {@code "yyyy-MM-dd--HH"} using JVM's default timezone.
  */
 public DateTimeBucketer() {
  this(DEFAULT_FORMAT_STRING);
 }

 /**
  * Creates a new {@code DateTimeBucketer} with the given date/time format string using JVM's default timezone.
  *
  * @param formatString The format string that will be given to {@code DateTimeFormatter} to determine
  * the bucket path.
  */
 public DateTimeBucketer(String formatString) {
  this(formatString, ZoneId.systemDefault());
 }

 /**
  * Creates a new {@code DateTimeBucketer} with the given date/time format string using the given timezone.
  *
  * @param formatString The format string that will be given to {@code DateTimeFormatter} to determine
  * the bucket path.
  * @param zoneId The timezone used to format {@code DateTimeFormatter} for bucket path.
  */
 public DateTimeBucketer(String formatString, ZoneId zoneId) {
  this.formatString = Preconditions.checkNotNull(formatString);
  this.zoneId = Preconditions.checkNotNull(zoneId);

  this.dateTimeFormatter = DateTimeFormatter.ofPattern(this.formatString).withZone(zoneId);
 }

 @Override
 public Path getBucketPath(Clock clock, Path basePath, T element) {
  //分桶關鍵代碼在這裏，通過clock獲取當前時間戳後格式
  String newDateTimeString = dateTimeFormatter.format(Instant.ofEpochMilli(clock.currentTimeMillis()));
  return new Path(basePath + "/" + newDateTimeString);
 }
}

以上代碼clock實例是在BucketingSink#open方法中實例化，代碼如下：

this.clock = new Clock() {
 @Override
 public long currentTimeMillis() {
  //直接返回當前處理時間
  return processingTimeService.getCurrentProcessingTime();
 }
};

結合以上源碼分析發現，使用DateTimeBucketer分桶是採用當前處理時間，採用當前處理時間必然會跟事件事件存在差異，因此會導致數據跨分區落入HDFS文件，舉個例子，假設有一條數據事件時間是2019-09-29 23:59:58，那這條數據應該落在2019/09/29分區，但由於這條數據延遲了3秒過來，當處理過來時當前處理時間已經是2019-09-30 00:00:01，所以這條數據會被落到2019/09/30分區，針對一些重要場景數據這樣的結果是不可接受的。

0x3 解決方案

從以上第二節源碼分析可以看出，解決問題的核心在getBucketPath方法中時間的獲取，只要把這裏的時間改爲事件即可，而正好這個方法的第三參數就是element，代表每一條記錄，只要記錄中有事件時間就可以獲取。既然現有的實現源碼不好改，那我們可以自己基於Bucketer接口實現一個EventTimeBucketer分桶器，實現源碼如下：

public class EventTimeBucketer implements Bucketer<BaseCountVO> {
    private static final String DEFAULT_FORMAT_STRING = "yyyy/MM/dd";

    private final String formatString;

    private final ZoneId zoneId;
    private transient DateTimeFormatter dateTimeFormatter;

    public EventTimeBucketer() {
        this(DEFAULT_FORMAT_STRING);
    }

    public EventTimeBucketer(String formatString) {
        this(formatString, ZoneId.systemDefault());
    }

    public EventTimeBucketer(ZoneId zoneId) {
        this(DEFAULT_FORMAT_STRING, zoneId);
    }

    public EventTimeBucketer(String formatString, ZoneId zoneId) {
        this.formatString = formatString;
        this.zoneId = zoneId;
        this.dateTimeFormatter = DateTimeFormatter.ofPattern(this.formatString).withZone(this.zoneId);
    }

    //記住，這個方法一定要加，否則dateTimeFormatter對象會是空，此方法會在反序列的時候調用，這樣才能正確初始化dateTimeFormatter對象
    //那有的人問了，上面構造函數不是初始化了嗎？反序列化的時候是不走構造函數的
    private void readObject(ObjectInputStream in) throws IOException, ClassNotFoundException {
        in.defaultReadObject();

        this.dateTimeFormatter = DateTimeFormatter.ofPattern(formatString).withZone(zoneId);
    }

    @Override
    public Path getBucketPath(Clock clock, Path basePath, BaseCountVO element) {
        String newDateTimeString = dateTimeFormatter.format(Instant.ofEpochMilli(element.getTimestamp()));
        return new Path(basePath + "/" + newDateTimeString);
    }
}

大家實際項目中可以把BaseCountVO改成自己的實體類即可，使用的時候只要換一下setBucketer值，代碼如下：

BucketingSink<Object> sink = new BucketingSink<>(path);
//通過這樣的方式來實現數據跨天分區
sink.setBucketer(new EventTimeBucketer<>("yyyy/MM/dd"));
sink.setWriter(new StringWriter<>());
sink.setBatchSize(1024 * 1024 * 256L);
sink.setBatchRolloverInterval(30 * 60 * 1000L);
sink.setInactiveBucketThreshold(3 * 60 * 1000L);
sink.setInactiveBucketCheckInterval(30 * 1000L);
sink.setInProgressSuffix(".in-progress");
sink.setPendingSuffix(".pending");

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Flink落HDFS數據按事件時間分區解決方案

0x1 摘要

0x2 問題點

0x3 解決方案

985 碩士程序員，空窗 4 個月沒有 Offer！

一文搞懂 Spring 循環依賴

賽博鬥地主——使用大語言模型扮演Agent智能體玩牌類遊戲。

VScode右鍵打開(添加到右鍵)

Flink落HDFS數據按事件時間分區解決方案

Flink WindowOperator 源碼分析

Flink 閉包清除源碼分析

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結