簡介

這個的介紹在我的另一篇博文中（Beam-介紹），在此不在再贅述，最近碰到個有意思的事，聊聊beam的鏈路，簡單來說自己操作的一些函數中間有些轉換組件，註冊在鏈路中，在此截了一張官網的圖片。

這是簡單鏈路大概樣子，各個函數串聯在一起，當然了實際中不可能這樣一帆風順，肯定遇到很多種情況，我列下幾種情況分享下。

集合註冊

PipelineOptionsFactory.register(IndexerPipelineOptions.class);
Pipeline pipeline = Pipeline.create(options);

        

PCollection<String> p1 = pipeline.apply(TextIO.read().from("")).apply(ParDo.of(new DoFn<String, String>() {
            @ProcessElement
            public void processElement(ProcessContext c) throws Exception {
                System.out.println(c.element().toString());
            }
        }));

PCollectionList<String> plist = PCollectionList.empty(pipeline);
plist.and(p1);
pipeline.run();

以導流的方式放到beam的集合，不斷apply函數等等，形成多種鏈路，中間可以拆分導流集合，或者合併集合都很簡單我就不說了，當然這些存儲的都是計劃，並沒有數據，核心思想移動計算不移動數據。

錯誤案例1

 public static void main(String[] args) throws PropertyVetoException {

        IndexerPipelineOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().as(IndexerPipelineOptions.class);

        PipelineOptionsFactory.register(IndexerPipelineOptions.class);
        Pipeline pipeline = Pipeline.create(options);
        
        String s1="insert into test values('11','11')";
        String s2="insert into test values('12','12')";
        String s3="insert into test values('13','13')";
        String s5="insert into test values('15','15')";
        String s6="insert into test values('16','16')";
        String s7="insert into test values('17','17')";
        String s4="insert into test values('14','14')";


        save(pipeline,s1);
        save(pipeline,s2);
        save(pipeline,s3);
        save(pipeline,s4);
        save(pipeline,s5);
        save(pipeline,s6);
        save(pipeline,s7);


        pipeline.run();

    }

public static void save(Pipeline pipeline,String sql) throws PropertyVetoException {
        ComboPooledDataSource cpds = new ComboPooledDataSource();
        cpds.setDriverClass("com.mysql.jdbc.Driver");
        cpds.setJdbcUrl("jdbc:mysql://xxxx:3306/bigdata?characterEncoding=utf8&useSSL=true");
        cpds.setUser("root");
        cpds.setPassword("root");

       
        Schema type =
                Schema.builder().addStringField("sass").build();
        Row build = Row.withSchema(type).addValue("123").build();

        pipeline
                .apply(Create.of(build))
                .apply(
                        JdbcIO.<Row>write()
                                .withDataSourceConfiguration(
                                        JdbcIO.DataSourceConfiguration.create(
                                                cpds))
                                .withStatement(sql)
                                .withPreparedStatementSetter(
                                        (element, statement) -> {

                                        })
                );

    }

一個簡單的多語句多輸出的操作，輸出多個PDone(Poutput)，因爲在同個pipeline中分發不同的輸出，又因beam集合本身是無序，註冊時沒有依賴關係，分發任務不會排序，所以結果亂序。這種情形會很多，比如返回很多pipeline對象再註冊繼續會亂序的，比如PCollection註冊鏈路再一起多個輸出也會如此結果，比如PCollectionList註冊順序後輸出結果也會亂序等等，經歷過很多失敗。

我使用JDBCIO連接hive一些大數據體系的庫，這樣用beam纔會用到些精髓的東西，做這些測試案例用mysql因爲方便些，原理相似。

錯誤案例2

 Schema type = Schema.builder().addStringField("test").build();
        Row row = Row.withSchema(type).addValue("test").build();
        PCollection<Row> r1 = pipeline.apply("r1",Create.of(row));
        PCollection<Row> r2 = pipeline.apply("r2",Create.of(row));
        PCollection<Row> r3 = pipeline.apply("r2",Create.of(row));
        PCollection<Row> r4= pipeline.apply("r4",Create.of(row));
        PCollection<Row> r5 = pipeline.apply("r5",Create.of(row));
        PCollection<Row> r6= pipeline.apply("r6",Create.of(row));
        PCollection<Row> r7 = pipeline.apply("r7",Create.of(row));

        PCollectionList<Row> pl = PCollectionList.of(r1).and(r2).and(r3).and(r4).and(r5).and(r6).and(r7);


        List<PCollection<Row>> all = pl.getAll();
        for (int i = 0; i < all.size(); i++) {
            save2(all.get(i),l.get(i));
        }

這樣鏈路輸出結果依舊會亂。

正確的操作

public static void main(String[] args) throws PropertyVetoException {

IndexerPipelineOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().as(IndexerPipelineOptions.class);

        PipelineOptionsFactory.register(IndexerPipelineOptions.class);
        Pipeline pipeline = Pipeline.create(options);
        Pipeline pipeline2 = Pipeline.create(options);
        Pipeline pipeline3 = Pipeline.create(options);
        Pipeline pipeline4 = Pipeline.create(options);

        String s1="insert into test values('11','11')";
        String s2="insert into test values('12','12')";
        String s3="insert into test values('13','13')";
        String s5="insert into test values('15','15')";
        String s6="insert into test values('16','16')";
        String s7="insert into test values('17','17')";
        String s4="insert into test values('14','14')";

        save(pipeline,s1).getPipeline().run();
        save(pipeline2,s2).getPipeline().run();
        save(pipeline3,s3).getPipeline().run();
        save(pipeline4,s4).getPipeline().run();

    }

其實這個用到核心思想，我在其他博文中講到的大數據處理四大設計模式-分離處理模式（如果你在處理數據集時並不想丟棄裏面的任何數據，而是想把數據分類爲不同的類別進行處理時，你就需要用到分離式來處理數據。）的應用，一個pipeline解決不了，拆分多個管道處理，多次運行，分離開來，當然效率會有損害（朋友們可以思考下），我說了說一些想法，有錯誤踩過的坑，有正確的做法，都是積累，分享給朋友們，有更好想法交流交流。

Beam-介紹：https://blog.csdn.net/qq_19968255/article/details/96158013

Beam-鏈路順序

簡介

集合註冊

錯誤案例1

錯誤案例2

正確的操作

PDManer [元數建模]-v4.9.0 發佈：一款簡單好用的數據庫建模平臺

使用neovim打造go ide(支持代碼跳轉, 代碼補全, 實時語法檢查)

sql求連續值問題

cs01 CSS Syntax

挑戰程序設計競賽 2.3章習題 poj 3046 Ant Counting

[MASM拾遺]Offset僞指令

h30 HTML Layout Elements

瞭解顯卡

一款基於C#開發的通訊調試工具（支持Modbus RTU、MQTT調試）

Linux/Golang/glibC系統調用

Hadoop-Yarn源碼-RPC基礎

Spark2.4.0-CDH6.3 錯誤解決過程

成爲博客專家

gitchat-Hive 權限管理應用

gitchat文章-用戶增長實戰（大數據應用）

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結