0x1 摘要
Hive離線數倉中爲了查詢分析方便,幾乎所有表都會劃分分區,最爲常見的是按天分區,Flink通過以下配置把數據寫入HDFS,
BucketingSink<Object> sink = new BucketingSink<>(path);
//通過這樣的方式來實現數據跨天分區
sink.setBucketer(new DateTimeBucketer<>("yyyy/MM/dd"));
sink.setWriter(new StringWriter<>());
sink.setBatchSize(1024 * 1024 * 256L);
sink.setBatchRolloverInterval(30 * 60 * 1000L);
sink.setInactiveBucketThreshold(3 * 60 * 1000L);
sink.setInactiveBucketCheckInterval(30 * 1000L);
sink.setInProgressSuffix(".in-progress");
sink.setPendingSuffix(".pending");
0x2 問題點
如果要做到數據完全正確的落到相應分區,那必須用eventTime
來劃分,我們先來看看DateTimeBucketer
桶實現代碼,
public class DateTimeBucketer<T> implements Bucketer<T> {
private static final long serialVersionUID = 1L;
private static final String DEFAULT_FORMAT_STRING = "yyyy-MM-dd--HH";
private final String formatString;
private final ZoneId zoneId;
private transient DateTimeFormatter dateTimeFormatter;
/**
* Creates a new {@code DateTimeBucketer} with format string {@code "yyyy-MM-dd--HH"} using JVM's default timezone.
*/
public DateTimeBucketer() {
this(DEFAULT_FORMAT_STRING);
}
/**
* Creates a new {@code DateTimeBucketer} with the given date/time format string using JVM's default timezone.
*
* @param formatString The format string that will be given to {@code DateTimeFormatter} to determine
* the bucket path.
*/
public DateTimeBucketer(String formatString) {
this(formatString, ZoneId.systemDefault());
}
/**
* Creates a new {@code DateTimeBucketer} with the given date/time format string using the given timezone.
*
* @param formatString The format string that will be given to {@code DateTimeFormatter} to determine
* the bucket path.
* @param zoneId The timezone used to format {@code DateTimeFormatter} for bucket path.
*/
public DateTimeBucketer(String formatString, ZoneId zoneId) {
this.formatString = Preconditions.checkNotNull(formatString);
this.zoneId = Preconditions.checkNotNull(zoneId);
this.dateTimeFormatter = DateTimeFormatter.ofPattern(this.formatString).withZone(zoneId);
}
@Override
public Path getBucketPath(Clock clock, Path basePath, T element) {
//分桶關鍵代碼在這裏,通過clock獲取當前時間戳後格式
String newDateTimeString = dateTimeFormatter.format(Instant.ofEpochMilli(clock.currentTimeMillis()));
return new Path(basePath + "/" + newDateTimeString);
}
}
以上代碼clock
實例是在BucketingSink#open
方法中實例化,代碼如下:
this.clock = new Clock() {
@Override
public long currentTimeMillis() {
//直接返回當前處理時間
return processingTimeService.getCurrentProcessingTime();
}
};
結合以上源碼分析發現,使用DateTimeBucketer
分桶是採用當前處理時間,採用當前處理時間必然會跟事件事件存在差異,因此會導致數據跨分區落入HDFS文件,舉個例子,假設有一條數據事件時間是2019-09-29 23:59:58
,那這條數據應該落在2019/09/29
分區,但由於這條數據延遲了3秒過來,當處理過來時當前處理時間已經是2019-09-30 00:00:01
,所以這條數據會被落到2019/09/30
分區,針對一些重要場景數據這樣的結果是不可接受的。
0x3 解決方案
從以上第二節源碼分析可以看出,解決問題的核心在getBucketPath
方法中時間的獲取,只要把這裏的時間改爲事件即可,而正好這個方法的第三參數就是element
,代表每一條記錄,只要記錄中有事件時間就可以獲取。既然現有的實現源碼不好改,那我們可以自己基於Bucketer
接口實現一個EventTimeBucketer
分桶器,實現源碼如下:
public class EventTimeBucketer implements Bucketer<BaseCountVO> {
private static final String DEFAULT_FORMAT_STRING = "yyyy/MM/dd";
private final String formatString;
private final ZoneId zoneId;
private transient DateTimeFormatter dateTimeFormatter;
public EventTimeBucketer() {
this(DEFAULT_FORMAT_STRING);
}
public EventTimeBucketer(String formatString) {
this(formatString, ZoneId.systemDefault());
}
public EventTimeBucketer(ZoneId zoneId) {
this(DEFAULT_FORMAT_STRING, zoneId);
}
public EventTimeBucketer(String formatString, ZoneId zoneId) {
this.formatString = formatString;
this.zoneId = zoneId;
this.dateTimeFormatter = DateTimeFormatter.ofPattern(this.formatString).withZone(this.zoneId);
}
//記住,這個方法一定要加,否則dateTimeFormatter對象會是空,此方法會在反序列的時候調用,這樣才能正確初始化dateTimeFormatter對象
//那有的人問了,上面構造函數不是初始化了嗎?反序列化的時候是不走構造函數的
private void readObject(ObjectInputStream in) throws IOException, ClassNotFoundException {
in.defaultReadObject();
this.dateTimeFormatter = DateTimeFormatter.ofPattern(formatString).withZone(zoneId);
}
@Override
public Path getBucketPath(Clock clock, Path basePath, BaseCountVO element) {
String newDateTimeString = dateTimeFormatter.format(Instant.ofEpochMilli(element.getTimestamp()));
return new Path(basePath + "/" + newDateTimeString);
}
}
大家實際項目中可以把BaseCountVO
改成自己的實體類即可,使用的時候只要換一下setBucketer
值,代碼如下:
BucketingSink<Object> sink = new BucketingSink<>(path);
//通過這樣的方式來實現數據跨天分區
sink.setBucketer(new EventTimeBucketer<>("yyyy/MM/dd"));
sink.setWriter(new StringWriter<>());
sink.setBatchSize(1024 * 1024 * 256L);
sink.setBatchRolloverInterval(30 * 60 * 1000L);
sink.setInactiveBucketThreshold(3 * 60 * 1000L);
sink.setInactiveBucketCheckInterval(30 * 1000L);
sink.setInProgressSuffix(".in-progress");
sink.setPendingSuffix(".pending");