簡介
這個的介紹在我的另一篇博文中(Beam-介紹),在此不在再贅述,最近碰到個有意思的事,聊聊beam的鏈路,簡單來說自己操作的一些函數中間有些轉換組件,註冊在鏈路中,在此截了一張官網的圖片。
這是簡單鏈路大概樣子,各個函數串聯在一起,當然了實際中不可能這樣一帆風順,肯定遇到很多種情況,我列下幾種情況分享下。
集合註冊
PipelineOptionsFactory.register(IndexerPipelineOptions.class);
Pipeline pipeline = Pipeline.create(options);
PCollection<String> p1 = pipeline.apply(TextIO.read().from("")).apply(ParDo.of(new DoFn<String, String>() {
@ProcessElement
public void processElement(ProcessContext c) throws Exception {
System.out.println(c.element().toString());
}
}));
PCollectionList<String> plist = PCollectionList.empty(pipeline);
plist.and(p1);
pipeline.run();
以導流的方式放到beam的集合,不斷apply函數等等,形成多種鏈路,中間可以拆分導流集合,或者合併集合都很簡單我就不說了,當然這些存儲的都是計劃,並沒有數據,核心思想移動計算不移動數據。
錯誤案例1
public static void main(String[] args) throws PropertyVetoException {
IndexerPipelineOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().as(IndexerPipelineOptions.class);
PipelineOptionsFactory.register(IndexerPipelineOptions.class);
Pipeline pipeline = Pipeline.create(options);
String s1="insert into test values('11','11')";
String s2="insert into test values('12','12')";
String s3="insert into test values('13','13')";
String s5="insert into test values('15','15')";
String s6="insert into test values('16','16')";
String s7="insert into test values('17','17')";
String s4="insert into test values('14','14')";
save(pipeline,s1);
save(pipeline,s2);
save(pipeline,s3);
save(pipeline,s4);
save(pipeline,s5);
save(pipeline,s6);
save(pipeline,s7);
pipeline.run();
}
public static void save(Pipeline pipeline,String sql) throws PropertyVetoException {
ComboPooledDataSource cpds = new ComboPooledDataSource();
cpds.setDriverClass("com.mysql.jdbc.Driver");
cpds.setJdbcUrl("jdbc:mysql://xxxx:3306/bigdata?characterEncoding=utf8&useSSL=true");
cpds.setUser("root");
cpds.setPassword("root");
Schema type =
Schema.builder().addStringField("sass").build();
Row build = Row.withSchema(type).addValue("123").build();
pipeline
.apply(Create.of(build))
.apply(
JdbcIO.<Row>write()
.withDataSourceConfiguration(
JdbcIO.DataSourceConfiguration.create(
cpds))
.withStatement(sql)
.withPreparedStatementSetter(
(element, statement) -> {
})
);
}
一個簡單的多語句多輸出的操作,輸出多個PDone(Poutput),因爲在同個pipeline中分發不同的輸出,又因beam集合本身是無序,註冊時沒有依賴關係,分發任務不會排序,所以結果亂序。這種情形會很多,比如返回很多pipeline對象再註冊繼續會亂序的,比如PCollection註冊鏈路再一起多個輸出也會如此結果,比如PCollectionList註冊順序後輸出結果也會亂序等等,經歷過很多失敗。
我使用JDBCIO連接hive一些大數據體系的庫,這樣用beam纔會用到些精髓的東西,做這些測試案例用mysql因爲方便些,原理相似。
錯誤案例2
Schema type = Schema.builder().addStringField("test").build();
Row row = Row.withSchema(type).addValue("test").build();
PCollection<Row> r1 = pipeline.apply("r1",Create.of(row));
PCollection<Row> r2 = pipeline.apply("r2",Create.of(row));
PCollection<Row> r3 = pipeline.apply("r2",Create.of(row));
PCollection<Row> r4= pipeline.apply("r4",Create.of(row));
PCollection<Row> r5 = pipeline.apply("r5",Create.of(row));
PCollection<Row> r6= pipeline.apply("r6",Create.of(row));
PCollection<Row> r7 = pipeline.apply("r7",Create.of(row));
PCollectionList<Row> pl = PCollectionList.of(r1).and(r2).and(r3).and(r4).and(r5).and(r6).and(r7);
List<PCollection<Row>> all = pl.getAll();
for (int i = 0; i < all.size(); i++) {
save2(all.get(i),l.get(i));
}
這樣鏈路輸出結果依舊會亂。
正確的操作
public static void main(String[] args) throws PropertyVetoException {
IndexerPipelineOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().as(IndexerPipelineOptions.class);
PipelineOptionsFactory.register(IndexerPipelineOptions.class);
Pipeline pipeline = Pipeline.create(options);
Pipeline pipeline2 = Pipeline.create(options);
Pipeline pipeline3 = Pipeline.create(options);
Pipeline pipeline4 = Pipeline.create(options);
String s1="insert into test values('11','11')";
String s2="insert into test values('12','12')";
String s3="insert into test values('13','13')";
String s5="insert into test values('15','15')";
String s6="insert into test values('16','16')";
String s7="insert into test values('17','17')";
String s4="insert into test values('14','14')";
save(pipeline,s1).getPipeline().run();
save(pipeline2,s2).getPipeline().run();
save(pipeline3,s3).getPipeline().run();
save(pipeline4,s4).getPipeline().run();
}
其實這個用到核心思想,我在其他博文中講到的大數據處理四大設計模式-分離處理模式(如果你在處理數據集時並不想丟棄裏面的任何數據,而是想把數據分類爲不同的類別進行處理時,你就需要用到分離式來處理數據。)的應用,一個pipeline解決不了,拆分多個管道處理,多次運行,分離開來,當然效率會有損害(朋友們可以思考下),我說了說一些想法,有錯誤踩過的坑,有正確的做法,都是積累,分享給朋友們,有更好想法交流交流。
Beam-介紹:https://blog.csdn.net/qq_19968255/article/details/96158013