Flink落HDFS數據按事件時間分區解決方案

0x1 摘要

Hive離線數倉中爲了查詢分析方便,幾乎所有表都會劃分分區,最爲常見的是按天分區,Flink通過以下配置把數據寫入HDFS,

BucketingSink<Object> sink = new BucketingSink<>(path);
//通過這樣的方式來實現數據跨天分區
sink.setBucketer(new DateTimeBucketer<>("yyyy/MM/dd"));
sink.setWriter(new StringWriter<>());
sink.setBatchSize(1024 * 1024 * 256L);
sink.setBatchRolloverInterval(30 * 60 * 1000L);
sink.setInactiveBucketThreshold(3 * 60 * 1000L);
sink.setInactiveBucketCheckInterval(30 * 1000L);
sink.setInProgressSuffix(".in-progress");
sink.setPendingSuffix(".pending");

0x2 問題點

如果要做到數據完全正確的落到相應分區,那必須用eventTime來劃分,我們先來看看DateTimeBucketer桶實現代碼,

public class DateTimeBucketer<T> implements Bucketer<T> {
 private static final long serialVersionUID = 1L;
 private static final String DEFAULT_FORMAT_STRING = "yyyy-MM-dd--HH";
 private final String formatString;
 private final ZoneId zoneId;
 private transient DateTimeFormatter dateTimeFormatter;

 /**
  * Creates a new {@code DateTimeBucketer} with format string {@code "yyyy-MM-dd--HH"} using JVM's default timezone.
  */
 public DateTimeBucketer() {
  this(DEFAULT_FORMAT_STRING);
 }

 /**
  * Creates a new {@code DateTimeBucketer} with the given date/time format string using JVM's default timezone.
  *
  * @param formatString The format string that will be given to {@code DateTimeFormatter} to determine
  * the bucket path.
  */
 public DateTimeBucketer(String formatString) {
  this(formatString, ZoneId.systemDefault());
 }

 /**
  * Creates a new {@code DateTimeBucketer} with the given date/time format string using the given timezone.
  *
  * @param formatString The format string that will be given to {@code DateTimeFormatter} to determine
  * the bucket path.
  * @param zoneId The timezone used to format {@code DateTimeFormatter} for bucket path.
  */
 public DateTimeBucketer(String formatString, ZoneId zoneId) {
  this.formatString = Preconditions.checkNotNull(formatString);
  this.zoneId = Preconditions.checkNotNull(zoneId);

  this.dateTimeFormatter = DateTimeFormatter.ofPattern(this.formatString).withZone(zoneId);
 }

 @Override
 public Path getBucketPath(Clock clock, Path basePath, T element) {
  //分桶關鍵代碼在這裏,通過clock獲取當前時間戳後格式
  String newDateTimeString = dateTimeFormatter.format(Instant.ofEpochMilli(clock.currentTimeMillis()));
  return new Path(basePath + "/" + newDateTimeString);
 }
}

以上代碼clock實例是在BucketingSink#open方法中實例化,代碼如下:

this.clock = new Clock() {
 @Override
 public long currentTimeMillis() {
  //直接返回當前處理時間
  return processingTimeService.getCurrentProcessingTime();
 }
};

結合以上源碼分析發現,使用DateTimeBucketer分桶是採用當前處理時間,採用當前處理時間必然會跟事件事件存在差異,因此會導致數據跨分區落入HDFS文件,舉個例子,假設有一條數據事件時間是2019-09-29 23:59:58,那這條數據應該落在2019/09/29分區,但由於這條數據延遲了3秒過來,當處理過來時當前處理時間已經是2019-09-30 00:00:01,所以這條數據會被落到2019/09/30分區,針對一些重要場景數據這樣的結果是不可接受的。

0x3 解決方案

從以上第二節源碼分析可以看出,解決問題的核心在getBucketPath方法中時間的獲取,只要把這裏的時間改爲事件即可,而正好這個方法的第三參數就是element,代表每一條記錄,只要記錄中有事件時間就可以獲取。既然現有的實現源碼不好改,那我們可以自己基於Bucketer接口實現一個EventTimeBucketer分桶器,實現源碼如下:

public class EventTimeBucketer implements Bucketer<BaseCountVO> {
    private static final String DEFAULT_FORMAT_STRING = "yyyy/MM/dd";

    private final String formatString;

    private final ZoneId zoneId;
    private transient DateTimeFormatter dateTimeFormatter;

    public EventTimeBucketer() {
        this(DEFAULT_FORMAT_STRING);
    }

    public EventTimeBucketer(String formatString) {
        this(formatString, ZoneId.systemDefault());
    }

    public EventTimeBucketer(ZoneId zoneId) {
        this(DEFAULT_FORMAT_STRING, zoneId);
    }

    public EventTimeBucketer(String formatString, ZoneId zoneId) {
        this.formatString = formatString;
        this.zoneId = zoneId;
        this.dateTimeFormatter = DateTimeFormatter.ofPattern(this.formatString).withZone(this.zoneId);
    }

    //記住,這個方法一定要加,否則dateTimeFormatter對象會是空,此方法會在反序列的時候調用,這樣才能正確初始化dateTimeFormatter對象
    //那有的人問了,上面構造函數不是初始化了嗎?反序列化的時候是不走構造函數的
    private void readObject(ObjectInputStream in) throws IOException, ClassNotFoundException {
        in.defaultReadObject();

        this.dateTimeFormatter = DateTimeFormatter.ofPattern(formatString).withZone(zoneId);
    }

    @Override
    public Path getBucketPath(Clock clock, Path basePath, BaseCountVO element) {
        String newDateTimeString = dateTimeFormatter.format(Instant.ofEpochMilli(element.getTimestamp()));
        return new Path(basePath + "/" + newDateTimeString);
    }
}

大家實際項目中可以把BaseCountVO改成自己的實體類即可,使用的時候只要換一下setBucketer值,代碼如下:

BucketingSink<Object> sink = new BucketingSink<>(path);
//通過這樣的方式來實現數據跨天分區
sink.setBucketer(new EventTimeBucketer<>("yyyy/MM/dd"));
sink.setWriter(new StringWriter<>());
sink.setBatchSize(1024 * 1024 * 256L);
sink.setBatchRolloverInterval(30 * 60 * 1000L);
sink.setInactiveBucketThreshold(3 * 60 * 1000L);
sink.setInactiveBucketCheckInterval(30 * 1000L);
sink.setInProgressSuffix(".in-progress");
sink.setPendingSuffix(".pending");
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章