Flink技術整理

 首先先拉取Flink的樣例代碼

mvn archetype:generate                               \
      -DarchetypeGroupId=org.apache.flink              \
      -DarchetypeArtifactId=flink-quickstart-java      \
      -DarchetypeVersion=1.7.2                         \
      -DarchetypeCatalog=local

實現從文件讀取的批處理

建立一個hello.txt,文件內容如下

hello world welcome
hello welcome

統計詞頻

public class BatchJavaApp {
    public static void main(String[] args) throws Exception {
        String input = "/Users/admin/Downloads/flink/data/hello.txt";
        ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
        DataSource<String> text = env.readTextFile(input);
        text.print();
        text.flatMap(new FlatMapFunction<String, Tuple2<String,Integer>>() {
            @Override
            public void flatMap(String value, Collector<Tuple2<String, Integer>> collector) throws Exception {
                String[] tokens = value.toLowerCase().split(" ");
                for (String token : tokens) {
                    collector.collect(new Tuple2<>(token,1));
                }
            }
        }).groupBy(0).sum(1).print();
    }
}

運行結果(日誌省略)

hello welcome
hello world welcome
(world,1)
(hello,2)
(welcome,2)

純Java實現

文件讀取類

public class FileOperation {
    /**
     * 讀取文件名稱爲filename中的內容,並將其中包含的所有詞語放進words中
     * @param filename
     * @param words
     * @return
     */
    public static boolean readFile(String filename, List<String> words) {
        if (filename == null || words == null) {
            System.out.println("filename爲空或words爲空");
            return false;
        }
        Scanner scanner;

        try {
            File file = new File(filename);
            if (file.exists()) {
                FileInputStream fis = new FileInputStream(file);
                scanner = new Scanner(new BufferedInputStream(fis),"UTF-8");
                scanner.useLocale(Locale.ENGLISH);
            }else {
                return false;
            }
        } catch (FileNotFoundException e) {
            System.out.println("無法打開" + filename);
            return false;
        }
        //簡單分詞
        if (scanner.hasNextLine()) {
            String contents = scanner.useDelimiter("\\A").next();
            int start = firstCharacterIndex(contents,0);
            for (int i = start + 1;i <= contents.length();) {
                if (i == contents.length() || !Character.isLetter(contents.charAt(i))) {
                    String word = contents.substring(start,i).toLowerCase();
                    words.add(word);
                    start = firstCharacterIndex(contents,i);
                    i = start + 1;
                }else {
                    i++;
                }
            }
        }
        return true;

    }

    private static int firstCharacterIndex(String s,int start) {
        for (int i = start;i < s.length();i++) {
            if (Character.isLetter(s.charAt(i))) {
                return i;
            }
        }
        return s.length();
    }
}
public class BatchJavaOnly {
    public static void main(String[] args) {
        String input = "/Users/admin/Downloads/flink/data/hello.txt";
        List<String> list = new ArrayList<>();
        FileOperation.readFile(input,list);
        System.out.println(list);
        Map<String,Integer> map = new HashMap<>();
        list.forEach(token -> {
            if (map.containsKey(token)) {
                map.put(token,map.get(token) + 1);
            }else {
                map.put(token,1);
            }
        });
        map.entrySet().stream().map(entry -> new Tuple2<>(entry.getKey(),entry.getValue()))
                .forEach(System.out::println);
    }
}

運行結果

[hello, world, welcome, hello, welcome]
(world,1)
(hello,2)
(welcome,2)

Scala代碼

拉取Scala樣例代碼

mvn archetype:generate                               \
      -DarchetypeGroupId=org.apache.flink              \
      -DarchetypeArtifactId=flink-quickstart-scala     \
      -DarchetypeVersion=1.7.2                         \
      -DarchetypeCatalog=local

安裝好Scala插件和配置好Scala SDK後

import org.apache.flink.api.scala.ExecutionEnvironment
import org.apache.flink.api.scala._

object BatchScalaApp {
  def main(args: Array[String]): Unit = {
    val input = "/Users/admin/Downloads/flink/data/hello.txt"
    val env = ExecutionEnvironment.getExecutionEnvironment
    val text = env.readTextFile(input)
    text.flatMap(_.toLowerCase.split(" "))
      .filter(_.nonEmpty)
      .map((_,1))
      .groupBy(0)
      .sum(1)
      .print()
  }
}

運行結果(省略日誌)

(world,1)
(hello,2)
(welcome,2)

Scala基礎內容請參考Scala入門篇 Scala入門之面向對象

從網絡傳輸的流式處理

public class StreamingJavaApp {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        DataStreamSource<String> text = env.socketTextStream("127.0.0.1",9999);
        text.flatMap(new FlatMapFunction<String, Tuple2<String,Integer>>() {
            @Override
            public void flatMap(String value, Collector<Tuple2<String, Integer>> collector) throws Exception {
                String[] tokens = value.toLowerCase().split(" ");
                for (String token : tokens) {
                    collector.collect(new Tuple2<>(token,1));
                }
            }
        }).keyBy(0).timeWindow(Time.seconds(5))
                .sum(1).print();
        env.execute("StreamingJavaApp");
    }
}

運行前打開端口

nc -lk 9999

運行代碼,在nc命令輸入a a c d b c e e f a

運行結果(日誌省略)

1> (e,2)
9> (d,1)
11> (a,3)
3> (b,1)
4> (f,1)
8> (c,2)

Scala代碼

import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.api.scala._

object StreamScalaApp {
  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    val text = env.socketTextStream("127.0.0.1",9999)
    text.flatMap(_.split(" "))
      .map((_,1))
      .keyBy(0)
      .timeWindow(Time.seconds(5))
      .sum(1)
      .print()
      .setParallelism(1)
    env.execute("StreamScalaApp")
  }
}

運行結果(省略日誌)

(c,2)
(b,1)
(d,1)
(f,1)
(e,2)
(a,3)

現在我們將元組改成實體類

public class StreamObjJavaApp {
    @AllArgsConstructor
    @Data
    @ToString
    @NoArgsConstructor
    public static class WordCount {
        private String word;
        private int count;
    }

    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        DataStreamSource<String> text = env.socketTextStream("127.0.0.1",9999);
        text.flatMap(new FlatMapFunction<String, WordCount>() {
            @Override
            public void flatMap(String value, Collector<WordCount> collector) throws Exception {
                String[] tokens = value.toLowerCase().split(" ");
                for (String token : tokens) {
                    collector.collect(new WordCount(token,1));
                }
            }
        }).keyBy("word").timeWindow(Time.seconds(5))
                .sum("count").print();
        env.execute("StreamingJavaApp");
    }
}

運行結果

4> StreamObjJavaApp.WordCount(word=f, count=1)
11> StreamObjJavaApp.WordCount(word=a, count=3)
8> StreamObjJavaApp.WordCount(word=c, count=2)
1> StreamObjJavaApp.WordCount(word=e, count=2)
9> StreamObjJavaApp.WordCount(word=d, count=1)
3> StreamObjJavaApp.WordCount(word=b, count=1)

當然我們也可以這麼寫

public class StreamObjJavaApp {
    @AllArgsConstructor
    @Data
    @ToString
    @NoArgsConstructor
    public static class WordCount {
        private String word;
        private int count;
    }

    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        DataStreamSource<String> text = env.socketTextStream("127.0.0.1",9999);
        text.flatMap(new FlatMapFunction<String, WordCount>() {
            @Override
            public void flatMap(String value, Collector<WordCount> collector) throws Exception {
                String[] tokens = value.toLowerCase().split(" ");
                for (String token : tokens) {
                    collector.collect(new WordCount(token,1));
                }
            }
        }).keyBy(WordCount::getWord).timeWindow(Time.seconds(5))
                .sum("count").print().setParallelism(1);
        env.execute("StreamingJavaApp");
    }
}

keyBy裏面是一個函數式接口KeySelector

@Public
@FunctionalInterface
public interface KeySelector<IN, KEY> extends Function, Serializable {
   KEY getKey(IN value) throws Exception;
}

flatMap的lambda表達式寫法,比較繁瑣,不如匿名類的寫法

public class StreamObjJavaApp {
    @AllArgsConstructor
    @Data
    @ToString
    @NoArgsConstructor
    public static class WordCount {
        private String word;
        private int count;
    }

    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        DataStreamSource<String> text = env.socketTextStream("127.0.0.1",9999);
        text.flatMap((FlatMapFunction<String,WordCount>)(value,collector) -> {
            String[] tokens = value.toLowerCase().split(" ");
            for (String token : tokens) {
                collector.collect(new WordCount(token,1));
            }
        }).returns(WordCount.class)
                .keyBy(WordCount::getWord).timeWindow(Time.seconds(5))
                .sum("count").print().setParallelism(1);
        env.execute("StreamingJavaApp");
    }
}

flatMap還可以使用RichFlatMapFunction抽象類

public class StreamObjJavaApp {
    @AllArgsConstructor
    @Data
    @ToString
    @NoArgsConstructor
    public static class WordCount {
        private String word;
        private int count;
    }

    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        DataStreamSource<String> text = env.socketTextStream("127.0.0.1",9999);
        text.flatMap(new RichFlatMapFunction<String, WordCount>() {
            @Override
            public void flatMap(String value, Collector<WordCount> collector) throws Exception {
                String[] tokens = value.toLowerCase().split(" ");
                for (String token : tokens) {
                    collector.collect(new WordCount(token,1));
                }
            }
        }).keyBy(WordCount::getWord).timeWindow(Time.seconds(5))
                .sum("count").print().setParallelism(1);
        env.execute("StreamingJavaApp");
    }
}

Scala代碼

import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.api.scala._

object StreamObjScalaApp {
  case class WordCount(word: String,count: Int)

  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    val text = env.socketTextStream("127.0.0.1",9999)
    text.flatMap(_.split(" "))
      .map(WordCount(_,1))
      .keyBy("word")
      .timeWindow(Time.seconds(5))
      .sum("count")
      .print()
      .setParallelism(1)
    env.execute("StreamScalaApp")
  }
}

運行結果(省略日誌)

WordCount(b,1)
WordCount(d,1)
WordCount(e,2)
WordCount(f,1)
WordCount(a,3)
WordCount(c,2)

數據源

從集合獲取數據

public class DataSetDataSourceApp {

    public static void main(String[] args) throws Exception {
        ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
        fromCollection(env);
    }

    public static void fromCollection(ExecutionEnvironment env) throws Exception {
        List<Integer> list = new ArrayList<>();
        for (int i = 1; i <= 10; i++) {
            list.add(i);
        }
        env.fromCollection(list).print();
    }
}

運行結果(省略日誌)

1
2
3
4
5
6
7
8
9
10

Scala代碼

import org.apache.flink.api.scala.ExecutionEnvironment
import org.apache.flink.api.scala._

object DataSetDataSourceApp {
  def main(args: Array[String]): Unit = {
    val env = ExecutionEnvironment.getExecutionEnvironment
    fromCollection(env)
  }

  def fromCollection(env: ExecutionEnvironment): Unit = {
    val data = 1 to 10
    env.fromCollection(data).print()
  }
}

運行結果(省略日誌)

1
2
3
4
5
6
7
8
9
10

從文件獲取數據

public class DataSetDataSourceApp {

    public static void main(String[] args) throws Exception {
        ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
//        fromCollection(env);
        textFile(env);
    }

    public static void textFile(ExecutionEnvironment env) throws Exception {
        String filePath = "/Users/admin/Downloads/flink/data/hello.txt";
        env.readTextFile(filePath).print();
    }

    public static void fromCollection(ExecutionEnvironment env) throws Exception {
        List<Integer> list = new ArrayList<>();
        for (int i = 1; i <= 10; i++) {
            list.add(i);
        }
        env.fromCollection(list).print();
    }
}

運行結果(省略日誌)

hello welcome
hello world welcome

Scala代碼

import org.apache.flink.api.scala.ExecutionEnvironment
import org.apache.flink.api.scala._

object DataSetDataSourceApp {
  def main(args: Array[String]): Unit = {
    val env = ExecutionEnvironment.getExecutionEnvironment
//    fromCollection(env)
    textFile(env)
  }

  def textFile(env: ExecutionEnvironment): Unit = {
    val filePath = "/Users/admin/Downloads/flink/data/hello.txt"
    env.readTextFile(filePath).print()
  }

  def fromCollection(env: ExecutionEnvironment): Unit = {
    val data = 1 to 10
    env.fromCollection(data).print()
  }
}

運行結果(省略日誌)

hello welcome
hello world welcome

從csv文件獲取數據

在data目錄下新增一個people.csv,內容如下

name,age,job
Jorge,30,Developer
Bob,32,Developer
public class DataSetDataSourceApp {

    public static void main(String[] args) throws Exception {
        ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
//        fromCollection(env);
//        textFile(env);
        csvFile(env);
    }

    public static void csvFile(ExecutionEnvironment env) throws Exception {
        String filePath = "/Users/admin/Downloads/flink/data/people.csv";
        env.readCsvFile(filePath).ignoreFirstLine()
                .types(String.class,Integer.class,String.class)
                .print();
    }

    public static void textFile(ExecutionEnvironment env) throws Exception {
        String filePath = "/Users/admin/Downloads/flink/data/hello.txt";
        env.readTextFile(filePath).print();
    }

    public static void fromCollection(ExecutionEnvironment env) throws Exception {
        List<Integer> list = new ArrayList<>();
        for (int i = 1; i <= 10; i++) {
            list.add(i);
        }
        env.fromCollection(list).print();
    }
}

運行結果(省略日誌)

(Bob,32,Developer)
(Jorge,30,Developer)

Scala代碼

import org.apache.flink.api.scala.ExecutionEnvironment
import org.apache.flink.api.scala._

object DataSetDataSourceApp {
  def main(args: Array[String]): Unit = {
    val env = ExecutionEnvironment.getExecutionEnvironment
//    fromCollection(env)
//    textFile(env)
    csvFile(env)
  }

  def csvFile(env: ExecutionEnvironment): Unit = {
    val filePath = "/Users/admin/Downloads/flink/data/people.csv"
    env.readCsvFile[(String,Int,String)](filePath,ignoreFirstLine = true).print()
  }

  def textFile(env: ExecutionEnvironment): Unit = {
    val filePath = "/Users/admin/Downloads/flink/data/hello.txt"
    env.readTextFile(filePath).print()
  }

  def fromCollection(env: ExecutionEnvironment): Unit = {
    val data = 1 to 10
    env.fromCollection(data).print()
  }
}

運行結果(省略日誌)

(Jorge,30,Developer)
(Bob,32,Developer)

將結果放入實體類中

public class DataSetDataSourceApp {
    @AllArgsConstructor
    @Data
    @ToString
    @NoArgsConstructor
    public static class Case {
        private String name;
        private Integer age;
        private String job;
    }

    public static void main(String[] args) throws Exception {
        ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
//        fromCollection(env);
//        textFile(env);
        csvFile(env);
    }

    public static void csvFile(ExecutionEnvironment env) throws Exception {
        String filePath = "/Users/admin/Downloads/flink/data/people.csv";
//        env.readCsvFile(filePath).ignoreFirstLine()
//                .types(String.class,Integer.class,String.class)
//                .print();
        env.readCsvFile(filePath).ignoreFirstLine()
                .pojoType(Case.class,"name","age","job")
                .print();
    }

    public static void textFile(ExecutionEnvironment env) throws Exception {
        String filePath = "/Users/admin/Downloads/flink/data/hello.txt";
        env.readTextFile(filePath).print();
    }

    public static void fromCollection(ExecutionEnvironment env) throws Exception {
        List<Integer> list = new ArrayList<>();
        for (int i = 1; i <= 10; i++) {
            list.add(i);
        }
        env.fromCollection(list).print();
    }
}

運行結果(省略日誌)

DataSetDataSourceApp.Case(name=Bob, age=32, job=Developer)
DataSetDataSourceApp.Case(name=Jorge, age=30, job=Developer)

Scala代碼

import org.apache.flink.api.scala.ExecutionEnvironment
import org.apache.flink.api.scala._

object DataSetDataSourceApp {
  case class Case(name: String,age: Int,job: String)

  def main(args: Array[String]): Unit = {
    val env = ExecutionEnvironment.getExecutionEnvironment
//    fromCollection(env)
//    textFile(env)
    csvFile(env)
  }

  def csvFile(env: ExecutionEnvironment): Unit = {
    val filePath = "/Users/admin/Downloads/flink/data/people.csv"
//    env.readCsvFile[(String,Int,String)](filePath,ignoreFirstLine = true).print()
    env.readCsvFile[Case](filePath,ignoreFirstLine = true,includedFields = Array(0,1,2))
      .print()
  }

  def textFile(env: ExecutionEnvironment): Unit = {
    val filePath = "/Users/admin/Downloads/flink/data/hello.txt"
    env.readTextFile(filePath).print()
  }

  def fromCollection(env: ExecutionEnvironment): Unit = {
    val data = 1 to 10
    env.fromCollection(data).print()
  }
}

運行結果(省略日誌)

Case(Bob,32,Developer)
Case(Jorge,30,Developer)

獲取遞歸文件夾

我們在data目錄下新增兩個文件夾1、2,將hello.txt分別拷貝進這兩個文件夾

public class DataSetDataSourceApp {
    @AllArgsConstructor
    @Data
    @ToString
    @NoArgsConstructor
    public static class Case {
        private String name;
        private Integer age;
        private String job;
    }

    public static void main(String[] args) throws Exception {
        ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
//        fromCollection(env);
//        textFile(env);
//        csvFile(env);
        readRecursiveFile(env);
    }

    public static void readRecursiveFile(ExecutionEnvironment env) throws Exception {
        String filePath = "/Users/admin/Downloads/flink/data";
        Configuration parameters = new Configuration();
        parameters.setBoolean("recursive.file.enumeration",true);
        env.readTextFile(filePath).withParameters(parameters)
                .print();
    }

    public static void csvFile(ExecutionEnvironment env) throws Exception {
        String filePath = "/Users/admin/Downloads/flink/data/people.csv";
//        env.readCsvFile(filePath).ignoreFirstLine()
//                .types(String.class,Integer.class,String.class)
//                .print();
        env.readCsvFile(filePath).ignoreFirstLine()
                .pojoType(Case.class,"name","age","job")
                .print();
    }

    public static void textFile(ExecutionEnvironment env) throws Exception {
        String filePath = "/Users/admin/Downloads/flink/data/hello.txt";
        env.readTextFile(filePath).print();
    }

    public static void fromCollection(ExecutionEnvironment env) throws Exception {
        List<Integer> list = new ArrayList<>();
        for (int i = 1; i <= 10; i++) {
            list.add(i);
        }
        env.fromCollection(list).print();
    }
}

運行結果(省略日誌)

hello world welcome
hello world welcome
hello welcome
Jorge,30,Developer
name,age,job
hello world welcome
hello welcome
hello welcome
Bob,32,Developer

Scala代碼

import org.apache.flink.api.scala.ExecutionEnvironment
import org.apache.flink.api.scala._
import org.apache.flink.configuration.Configuration

object DataSetDataSourceApp {
  case class Case(name: String,age: Int,job: String)

  def main(args: Array[String]): Unit = {
    val env = ExecutionEnvironment.getExecutionEnvironment
//    fromCollection(env)
//    textFile(env)
//    csvFile(env)
    readRecursiveFiles(env)
  }

  def readRecursiveFiles(env: ExecutionEnvironment): Unit = {
    val filePath = "/Users/admin/Downloads/flink/data"
    val parameters = new Configuration
    parameters.setBoolean("recursive.file.enumeration",true)
    env.readTextFile(filePath).withParameters(parameters).print()
  }

  def csvFile(env: ExecutionEnvironment): Unit = {
    val filePath = "/Users/admin/Downloads/flink/data/people.csv"
//    env.readCsvFile[(String,Int,String)](filePath,ignoreFirstLine = true).print()
    env.readCsvFile[Case](filePath,ignoreFirstLine = true,includedFields = Array(0,1,2))
      .print()
  }

  def textFile(env: ExecutionEnvironment): Unit = {
    val filePath = "/Users/admin/Downloads/flink/data/hello.txt"
    env.readTextFile(filePath).print()
  }

  def fromCollection(env: ExecutionEnvironment): Unit = {
    val data = 1 to 10
    env.fromCollection(data).print()
  }
}

運行結果(省略日誌)

hello world welcome
hello world welcome
hello welcome
Jorge,30,Developer
name,age,job
hello world welcome
hello welcome
hello welcome
Bob,32,Developer

獲取壓縮文件

在data文件夾下新建一個文件夾3,並壓縮hello.txt

gzip hello.txt

得到一個新的文件hello.txt.gz,將改文件放入3中

public class DataSetDataSourceApp {
    @AllArgsConstructor
    @Data
    @ToString
    @NoArgsConstructor
    public static class Case {
        private String name;
        private Integer age;
        private String job;
    }

    public static void main(String[] args) throws Exception {
        ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
//        fromCollection(env);
//        textFile(env);
//        csvFile(env);
//        readRecursiveFile(env);
        readCompresssionFile(env);
    }

    public static void readCompresssionFile(ExecutionEnvironment env) throws Exception {
        String filePath = "/Users/admin/Downloads/flink/data/3";
        env.readTextFile(filePath).print();
    }

    public static void readRecursiveFile(ExecutionEnvironment env) throws Exception {
        String filePath = "/Users/admin/Downloads/flink/data";
        Configuration parameters = new Configuration();
        parameters.setBoolean("recursive.file.enumeration",true);
        env.readTextFile(filePath).withParameters(parameters)
                .print();
    }

    public static void csvFile(ExecutionEnvironment env) throws Exception {
        String filePath = "/Users/admin/Downloads/flink/data/people.csv";
//        env.readCsvFile(filePath).ignoreFirstLine()
//                .types(String.class,Integer.class,String.class)
//                .print();
        env.readCsvFile(filePath).ignoreFirstLine()
                .pojoType(Case.class,"name","age","job")
                .print();
    }

    public static void textFile(ExecutionEnvironment env) throws Exception {
        String filePath = "/Users/admin/Downloads/flink/data/hello.txt";
        env.readTextFile(filePath).print();
    }

    public static void fromCollection(ExecutionEnvironment env) throws Exception {
        List<Integer> list = new ArrayList<>();
        for (int i = 1; i <= 10; i++) {
            list.add(i);
        }
        env.fromCollection(list).print();
    }
}

運行結果

hello world welcome
hello welcome

flink支持的壓縮格式有:.deflate,.gz,.gzip,.bz2,.xz

Scala代碼

import org.apache.flink.api.scala.ExecutionEnvironment
import org.apache.flink.api.scala._
import org.apache.flink.configuration.Configuration

object DataSetDataSourceApp {
  case class Case(name: String,age: Int,job: String)

  def main(args: Array[String]): Unit = {
    val env = ExecutionEnvironment.getExecutionEnvironment
//    fromCollection(env)
//    textFile(env)
//    csvFile(env)
//    readRecursiveFiles(env)
    readCompressionFiles(env)
  }

  def readCompressionFiles(env: ExecutionEnvironment): Unit = {
    val filePath = "/Users/admin/Downloads/flink/data/3"
    env.readTextFile(filePath).print()
  }

  def readRecursiveFiles(env: ExecutionEnvironment): Unit = {
    val filePath = "/Users/admin/Downloads/flink/data"
    val parameters = new Configuration
    parameters.setBoolean("recursive.file.enumeration",true)
    env.readTextFile(filePath).withParameters(parameters).print()
  }

  def csvFile(env: ExecutionEnvironment): Unit = {
    val filePath = "/Users/admin/Downloads/flink/data/people.csv"
//    env.readCsvFile[(String,Int,String)](filePath,ignoreFirstLine = true).print()
    env.readCsvFile[Case](filePath,ignoreFirstLine = true,includedFields = Array(0,1,2))
      .print()
  }

  def textFile(env: ExecutionEnvironment): Unit = {
    val filePath = "/Users/admin/Downloads/flink/data/hello.txt"
    env.readTextFile(filePath).print()
  }

  def fromCollection(env: ExecutionEnvironment): Unit = {
    val data = 1 to 10
    env.fromCollection(data).print()
  }
}

運行結果

hello world welcome
hello welcome

算子

map算子

public class DataSetTransformationApp {
    public static void main(String[] args) throws Exception {
        ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
        mapFunction(env);
    }

    public static void mapFunction(ExecutionEnvironment env) throws Exception {
        DataSource<Integer> data = env.fromCollection(Arrays.asList(1,2,3,4,5,6,7,8,9,10));
        data.map(x -> x + 1).print();
    }
}

運行結果(省略日誌)

2
3
4
5
6
7
8
9
10
11

Scala代碼

import org.apache.flink.api.scala.ExecutionEnvironment
import org.apache.flink.api.scala._

object DataSetTransformationApp {
  def main(args: Array[String]): Unit = {
    val env = ExecutionEnvironment.getExecutionEnvironment
    mapFunction(env)
  }

  def mapFunction(env: ExecutionEnvironment): Unit = {
    val data = env.fromCollection(List(1,2,3,4,5,6,7,8,9,10))
    data.map(_ + 1).print()
  }
}

運行結果(省略日誌)

2
3
4
5
6
7
8
9
10
11

filter算子

public class DataSetTransformationApp {
    public static void main(String[] args) throws Exception {
        ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
//        mapFunction(env);
        filterFunction(env);
    }

    public static void filterFunction(ExecutionEnvironment env) throws Exception {
        DataSource<Integer> data = env.fromCollection(Arrays.asList(1,2,3,4,5,6,7,8,9,10));
        data.map(x -> x + 1).filter(x -> x > 5).print();
    }

    public static void mapFunction(ExecutionEnvironment env) throws Exception {
        DataSource<Integer> data = env.fromCollection(Arrays.asList(1,2,3,4,5,6,7,8,9,10));
        data.map(x -> x + 1).print();
    }
}

運行結果(省略日誌)

6
7
8
9
10
11

Scala代碼

import org.apache.flink.api.scala.ExecutionEnvironment
import org.apache.flink.api.scala._

object DataSetTransformationApp {
  def main(args: Array[String]): Unit = {
    val env = ExecutionEnvironment.getExecutionEnvironment
//    mapFunction(env)
    filterFunction(env)
  }

  def filterFunction(env: ExecutionEnvironment): Unit = {
    val data = env.fromCollection(List(1,2,3,4,5,6,7,8,9,10))
    data.map(_ + 1).filter(_ > 5).print()
  }

  def mapFunction(env: ExecutionEnvironment): Unit = {
    val data = env.fromCollection(List(1,2,3,4,5,6,7,8,9,10))
    data.map(_ + 1).print()
  }
}

運行結果(省略日誌)

6
7
8
9
10
11

mapPartition算子

按照並行度來分區返回結果

模擬一個數據庫連接的工具類

public class DBUntils {
    public static int getConnection() {
        return new Random().nextInt(10);
    }

    public static void returnConnection(int connection) {

    }
}
public class DataSetTransformationApp {
    public static void main(String[] args) throws Exception {
        ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
//        mapFunction(env);
//        filterFunction(env);
        mapPartitionFunction(env);
    }

    public static void mapPartitionFunction(ExecutionEnvironment env) throws Exception {
        List<String> students = new ArrayList<>();
        for (int i = 0; i < 100; i++) {
            students.add("student: " + i);
        }
        DataSource<String> data = env.fromCollection(students).setParallelism(4);
        //此處會按照students的數量進行轉換
        data.map(student -> {
            int connection = DBUntils.getConnection();
            //TODO 數據庫操作
            DBUntils.returnConnection(connection);
            return connection;
        }).print();
        System.out.println("-----------");
        //此處會按照並行度的數量進行轉換
        data.mapPartition((MapPartitionFunction<String,Integer>)(student, collector) -> {
            int connection = DBUntils.getConnection();
            //TODO 數據庫操作
            DBUntils.returnConnection(connection);
            collector.collect(connection);
        }).returns(Integer.class).print();
    }

    public static void filterFunction(ExecutionEnvironment env) throws Exception {
        DataSource<Integer> data = env.fromCollection(Arrays.asList(1,2,3,4,5,6,7,8,9,10));
        data.map(x -> x + 1).filter(x -> x > 5).print();
    }

    public static void mapFunction(ExecutionEnvironment env) throws Exception {
        DataSource<Integer> data = env.fromCollection(Arrays.asList(1,2,3,4,5,6,7,8,9,10));
        data.map(x -> x + 1).print();
    }
}

運行結果(省略日誌,點號代表上面還有很多數字,橫線上方總共有100個)

.
.
.
.
5
4
0
3
-----------
5
5
0
3

Scala代碼

import scala.util.Random

object DBUntils {
  def getConnection(): Int = {
    new Random().nextInt(10)
  }

  def returnConnection(connection: Int): Unit = {

  }
}
import com.guanjian.flink.scala.until.DBUntils
import org.apache.flink.api.scala.ExecutionEnvironment
import org.apache.flink.api.scala._
import org.apache.flink.util.Collector

import scala.collection.mutable.ListBuffer

object DataSetTransformationApp {
  def main(args: Array[String]): Unit = {
    val env = ExecutionEnvironment.getExecutionEnvironment
//    mapFunction(env)
//    filterFunction(env)
    mapPartitionFunction(env)
  }

  def mapPartitionFunction(env: ExecutionEnvironment): Unit = {
    val students = new ListBuffer[String]
    for(i <- 1 to 100) {
      students.append("student: " + i)
    }
    val data = env.fromCollection(students).setParallelism(4)
    data.map(student => {
      val connection = DBUntils.getConnection()
      //TODO 數據庫操作
      DBUntils.returnConnection(connection)
      connection
    }).print()
    println("-----------")
    data.mapPartition((student,collector: Collector[Int]) => {
      val connection = DBUntils.getConnection()
      //TODO 數據庫操作
      DBUntils.returnConnection(connection)
      collector.collect(connection)
    }).print()
  }

  def filterFunction(env: ExecutionEnvironment): Unit = {
    val data = env.fromCollection(List(1,2,3,4,5,6,7,8,9,10))
    data.map(_ + 1).filter(_ > 5).print()
  }

  def mapFunction(env: ExecutionEnvironment): Unit = {
    val data = env.fromCollection(List(1,2,3,4,5,6,7,8,9,10))
    data.map(_ + 1).print()
  }
}

運行結果

.
.
.
.
5
4
0
3
-----------
5
5
0
3

first算子

public class DataSetTransformationApp {
    public static void main(String[] args) throws Exception {
        ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
//        mapFunction(env);
//        filterFunction(env);
//        mapPartitionFunction(env);
        firstFunction(env);
    }

    public static void firstFunction(ExecutionEnvironment env) throws Exception {
        List<Tuple2<Integer,String>> info = Arrays.asList(new Tuple2<>(1,"Hadoop"),
                new Tuple2<>(1,"Spark"),
                new Tuple2<>(1,"Flink"),
                new Tuple2<>(2,"Java"),
                new Tuple2<>(2,"Spring boot"),
                new Tuple2<>(3,"Linux"),
                new Tuple2<>(4,"VUE"));
        DataSource<Tuple2<Integer, String>> data = env.fromCollection(info);
        data.first(3).print();
    }

    public static void mapPartitionFunction(ExecutionEnvironment env) throws Exception {
        List<String> students = new ArrayList<>();
        for (int i = 0; i < 100; i++) {
            students.add("student: " + i);
        }
        DataSource<String> data = env.fromCollection(students).setParallelism(4);
        //此處會按照students的數量進行轉換
        data.map(student -> {
            int connection = DBUntils.getConnection();
            //TODO 數據庫操作
            DBUntils.returnConnection(connection);
            return connection;
        }).print();
        System.out.println("-----------");
        //此處會按照並行度的數量進行轉換
        data.mapPartition((MapPartitionFunction<String,Integer>)(student, collector) -> {
            int connection = DBUntils.getConnection();
            //TODO 數據庫操作
            DBUntils.returnConnection(connection);
            collector.collect(connection);
        }).returns(Integer.class).print();
    }

    public static void filterFunction(ExecutionEnvironment env) throws Exception {
        DataSource<Integer> data = env.fromCollection(Arrays.asList(1,2,3,4,5,6,7,8,9,10));
        data.map(x -> x + 1).filter(x -> x > 5).print();
    }

    public static void mapFunction(ExecutionEnvironment env) throws Exception {
        DataSource<Integer> data = env.fromCollection(Arrays.asList(1,2,3,4,5,6,7,8,9,10));
        data.map(x -> x + 1).print();
    }
}

運行結果(省略日誌)

(1,Hadoop)
(1,Spark)
(1,Flink)

Scala代碼

import com.guanjian.flink.scala.until.DBUntils
import org.apache.flink.api.scala.ExecutionEnvironment
import org.apache.flink.api.scala._
import org.apache.flink.util.Collector

import scala.collection.mutable.ListBuffer

object DataSetTransformationApp {
  def main(args: Array[String]): Unit = {
    val env = ExecutionEnvironment.getExecutionEnvironment
//    mapFunction(env)
//    filterFunction(env)
//    mapPartitionFunction(env)
    firstFunction(env)
  }

  def firstFunction(env: ExecutionEnvironment): Unit = {
    val info = List((1,"Hadoop"),(1,"Spark"),(1,"Flink"),(2,"Java"),
      (2,"Spring boot"),(3,"Linux"),(4,"VUE"))
    val data = env.fromCollection(info)
    data.first(3).print()
  }

  def mapPartitionFunction(env: ExecutionEnvironment): Unit = {
    val students = new ListBuffer[String]
    for(i <- 1 to 100) {
      students.append("student: " + i)
    }
    val data = env.fromCollection(students).setParallelism(4)
    data.map(student => {
      val connection = DBUntils.getConnection()
      //TODO 數據庫操作
      DBUntils.returnConnection(connection)
      connection
    }).print()
    println("-----------")
    data.mapPartition((student,collector: Collector[Int]) => {
      val connection = DBUntils.getConnection()
      //TODO 數據庫操作
      DBUntils.returnConnection(connection)
      collector.collect(connection)
    }).print()
  }

  def filterFunction(env: ExecutionEnvironment): Unit = {
    val data = env.fromCollection(List(1,2,3,4,5,6,7,8,9,10))
    data.map(_ + 1).filter(_ > 5).print()
  }

  def mapFunction(env: ExecutionEnvironment): Unit = {
    val data = env.fromCollection(List(1,2,3,4,5,6,7,8,9,10))
    data.map(_ + 1).print()
  }
}

運行結果(省略日誌)

(1,Hadoop)
(1,Spark)
(1,Flink)

分組取前兩條

public class DataSetTransformationApp {
    public static void main(String[] args) throws Exception {
        ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
//        mapFunction(env);
//        filterFunction(env);
//        mapPartitionFunction(env);
        firstFunction(env);
    }

    public static void firstFunction(ExecutionEnvironment env) throws Exception {
        List<Tuple2<Integer,String>> info = Arrays.asList(new Tuple2<>(1,"Hadoop"),
                new Tuple2<>(1,"Spark"),
                new Tuple2<>(1,"Flink"),
                new Tuple2<>(2,"Java"),
                new Tuple2<>(2,"Spring boot"),
                new Tuple2<>(3,"Linux"),
                new Tuple2<>(4,"VUE"));
        DataSource<Tuple2<Integer, String>> data = env.fromCollection(info);
        data.groupBy(0).first(2).print();
    }

    public static void mapPartitionFunction(ExecutionEnvironment env) throws Exception {
        List<String> students = new ArrayList<>();
        for (int i = 0; i < 100; i++) {
            students.add("student: " + i);
        }
        DataSource<String> data = env.fromCollection(students).setParallelism(4);
        //此處會按照students的數量進行轉換
        data.map(student -> {
            int connection = DBUntils.getConnection();
            //TODO 數據庫操作
            DBUntils.returnConnection(connection);
            return connection;
        }).print();
        System.out.println("-----------");
        //此處會按照並行度的數量進行轉換
        data.mapPartition((MapPartitionFunction<String,Integer>)(student, collector) -> {
            int connection = DBUntils.getConnection();
            //TODO 數據庫操作
            DBUntils.returnConnection(connection);
            collector.collect(connection);
        }).returns(Integer.class).print();
    }

    public static void filterFunction(ExecutionEnvironment env) throws Exception {
        DataSource<Integer> data = env.fromCollection(Arrays.asList(1,2,3,4,5,6,7,8,9,10));
        data.map(x -> x + 1).filter(x -> x > 5).print();
    }

    public static void mapFunction(ExecutionEnvironment env) throws Exception {
        DataSource<Integer> data = env.fromCollection(Arrays.asList(1,2,3,4,5,6,7,8,9,10));
        data.map(x -> x + 1).print();
    }
}

運行結果(省略日誌)

(3,Linux)
(1,Hadoop)
(1,Spark)
(4,VUE)
(2,Java)
(2,Spring boot)

Scala代碼

import com.guanjian.flink.scala.until.DBUntils
import org.apache.flink.api.scala.ExecutionEnvironment
import org.apache.flink.api.scala._
import org.apache.flink.util.Collector

import scala.collection.mutable.ListBuffer

object DataSetTransformationApp {
  def main(args: Array[String]): Unit = {
    val env = ExecutionEnvironment.getExecutionEnvironment
//    mapFunction(env)
//    filterFunction(env)
//    mapPartitionFunction(env)
    firstFunction(env)
  }

  def firstFunction(env: ExecutionEnvironment): Unit = {
    val info = List((1,"Hadoop"),(1,"Spark"),(1,"Flink"),(2,"Java"),
      (2,"Spring boot"),(3,"Linux"),(4,"VUE"))
    val data = env.fromCollection(info)
    data.groupBy(0).first(2).print()
  }

  def mapPartitionFunction(env: ExecutionEnvironment): Unit = {
    val students = new ListBuffer[String]
    for(i <- 1 to 100) {
      students.append("student: " + i)
    }
    val data = env.fromCollection(students).setParallelism(4)
    data.map(student => {
      val connection = DBUntils.getConnection()
      //TODO 數據庫操作
      DBUntils.returnConnection(connection)
      connection
    }).print()
    println("-----------")
    data.mapPartition((student,collector: Collector[Int]) => {
      val connection = DBUntils.getConnection()
      //TODO 數據庫操作
      DBUntils.returnConnection(connection)
      collector.collect(connection)
    }).print()
  }

  def filterFunction(env: ExecutionEnvironment): Unit = {
    val data = env.fromCollection(List(1,2,3,4,5,6,7,8,9,10))
    data.map(_ + 1).filter(_ > 5).print()
  }

  def mapFunction(env: ExecutionEnvironment): Unit = {
    val data = env.fromCollection(List(1,2,3,4,5,6,7,8,9,10))
    data.map(_ + 1).print()
  }
}

運行結果(省略日誌)

(3,Linux)
(1,Hadoop)
(1,Spark)
(4,VUE)
(2,Java)
(2,Spring boot)

分組以後按升序取前兩條

public class DataSetTransformationApp {
    public static void main(String[] args) throws Exception {
        ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
//        mapFunction(env);
//        filterFunction(env);
//        mapPartitionFunction(env);
        firstFunction(env);
    }

    public static void firstFunction(ExecutionEnvironment env) throws Exception {
        List<Tuple2<Integer,String>> info = Arrays.asList(new Tuple2<>(1,"Hadoop"),
                new Tuple2<>(1,"Spark"),
                new Tuple2<>(1,"Flink"),
                new Tuple2<>(2,"Java"),
                new Tuple2<>(2,"Spring boot"),
                new Tuple2<>(3,"Linux"),
                new Tuple2<>(4,"VUE"));
        DataSource<Tuple2<Integer, String>> data = env.fromCollection(info);
        data.groupBy(0).sortGroup(1, Order.ASCENDING).first(2).print();
    }

    public static void mapPartitionFunction(ExecutionEnvironment env) throws Exception {
        List<String> students = new ArrayList<>();
        for (int i = 0; i < 100; i++) {
            students.add("student: " + i);
        }
        DataSource<String> data = env.fromCollection(students).setParallelism(4);
        //此處會按照students的數量進行轉換
        data.map(student -> {
            int connection = DBUntils.getConnection();
            //TODO 數據庫操作
            DBUntils.returnConnection(connection);
            return connection;
        }).print();
        System.out.println("-----------");
        //此處會按照並行度的數量進行轉換
        data.mapPartition((MapPartitionFunction<String,Integer>)(student, collector) -> {
            int connection = DBUntils.getConnection();
            //TODO 數據庫操作
            DBUntils.returnConnection(connection);
            collector.collect(connection);
        }).returns(Integer.class).print();
    }

    public static void filterFunction(ExecutionEnvironment env) throws Exception {
        DataSource<Integer> data = env.fromCollection(Arrays.asList(1,2,3,4,5,6,7,8,9,10));
        data.map(x -> x + 1).filter(x -> x > 5).print();
    }

    public static void mapFunction(ExecutionEnvironment env) throws Exception {
        DataSource<Integer> data = env.fromCollection(Arrays.asList(1,2,3,4,5,6,7,8,9,10));
        data.map(x -> x + 1).print();
    }
}

運行結果(省略日誌)

(3,Linux)
(1,Flink)
(1,Hadoop)
(4,VUE)
(2,Java)
(2,Spring boot)

Scala代碼

import com.guanjian.flink.scala.until.DBUntils
import org.apache.flink.api.common.operators.Order
import org.apache.flink.api.scala.ExecutionEnvironment
import org.apache.flink.api.scala._
import org.apache.flink.util.Collector

import scala.collection.mutable.ListBuffer

object DataSetTransformationApp {
  def main(args: Array[String]): Unit = {
    val env = ExecutionEnvironment.getExecutionEnvironment
//    mapFunction(env)
//    filterFunction(env)
//    mapPartitionFunction(env)
    firstFunction(env)
  }

  def firstFunction(env: ExecutionEnvironment): Unit = {
    val info = List((1,"Hadoop"),(1,"Spark"),(1,"Flink"),(2,"Java"),
      (2,"Spring boot"),(3,"Linux"),(4,"VUE"))
    val data = env.fromCollection(info)
    data.groupBy(0).sortGroup(1,Order.ASCENDING).first(2).print()
  }

  def mapPartitionFunction(env: ExecutionEnvironment): Unit = {
    val students = new ListBuffer[String]
    for(i <- 1 to 100) {
      students.append("student: " + i)
    }
    val data = env.fromCollection(students).setParallelism(4)
    data.map(student => {
      val connection = DBUntils.getConnection()
      //TODO 數據庫操作
      DBUntils.returnConnection(connection)
      connection
    }).print()
    println("-----------")
    data.mapPartition((student,collector: Collector[Int]) => {
      val connection = DBUntils.getConnection()
      //TODO 數據庫操作
      DBUntils.returnConnection(connection)
      collector.collect(connection)
    }).print()
  }

  def filterFunction(env: ExecutionEnvironment): Unit = {
    val data = env.fromCollection(List(1,2,3,4,5,6,7,8,9,10))
    data.map(_ + 1).filter(_ > 5).print()
  }

  def mapFunction(env: ExecutionEnvironment): Unit = {
    val data = env.fromCollection(List(1,2,3,4,5,6,7,8,9,10))
    data.map(_ + 1).print()
  }
}

運行結果(省略日誌)

(3,Linux)
(1,Flink)
(1,Hadoop)
(4,VUE)
(2,Java)
(2,Spring boot)

分組以後按降序取前兩條

public class DataSetTransformationApp {
    public static void main(String[] args) throws Exception {
        ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
//        mapFunction(env);
//        filterFunction(env);
//        mapPartitionFunction(env);
        firstFunction(env);
    }

    public static void firstFunction(ExecutionEnvironment env) throws Exception {
        List<Tuple2<Integer,String>> info = Arrays.asList(new Tuple2<>(1,"Hadoop"),
                new Tuple2<>(1,"Spark"),
                new Tuple2<>(1,"Flink"),
                new Tuple2<>(2,"Java"),
                new Tuple2<>(2,"Spring boot"),
                new Tuple2<>(3,"Linux"),
                new Tuple2<>(4,"VUE"));
        DataSource<Tuple2<Integer, String>> data = env.fromCollection(info);
        data.groupBy(0).sortGroup(1, Order.DESCENDING).first(2).print();
    }

    public static void mapPartitionFunction(ExecutionEnvironment env) throws Exception {
        List<String> students = new ArrayList<>();
        for (int i = 0; i < 100; i++) {
            students.add("student: " + i);
        }
        DataSource<String> data = env.fromCollection(students).setParallelism(4);
        //此處會按照students的數量進行轉換
        data.map(student -> {
            int connection = DBUntils.getConnection();
            //TODO 數據庫操作
            DBUntils.returnConnection(connection);
            return connection;
        }).print();
        System.out.println("-----------");
        //此處會按照並行度的數量進行轉換
        data.mapPartition((MapPartitionFunction<String,Integer>)(student, collector) -> {
            int connection = DBUntils.getConnection();
            //TODO 數據庫操作
            DBUntils.returnConnection(connection);
            collector.collect(connection);
        }).returns(Integer.class).print();
    }

    public static void filterFunction(ExecutionEnvironment env) throws Exception {
        DataSource<Integer> data = env.fromCollection(Arrays.asList(1,2,3,4,5,6,7,8,9,10));
        data.map(x -> x + 1).filter(x -> x > 5).print();
    }

    public static void mapFunction(ExecutionEnvironment env) throws Exception {
        DataSource<Integer> data = env.fromCollection(Arrays.asList(1,2,3,4,5,6,7,8,9,10));
        data.map(x -> x + 1).print();
    }
}

運行結果(省略日誌)

(3,Linux)
(1,Spark)
(1,Hadoop)
(4,VUE)
(2,Spring boot)
(2,Java)

Scala代碼

import com.guanjian.flink.scala.until.DBUntils
import org.apache.flink.api.common.operators.Order
import org.apache.flink.api.scala.ExecutionEnvironment
import org.apache.flink.api.scala._
import org.apache.flink.util.Collector

import scala.collection.mutable.ListBuffer

object DataSetTransformationApp {
  def main(args: Array[String]): Unit = {
    val env = ExecutionEnvironment.getExecutionEnvironment
//    mapFunction(env)
//    filterFunction(env)
//    mapPartitionFunction(env)
    firstFunction(env)
  }

  def firstFunction(env: ExecutionEnvironment): Unit = {
    val info = List((1,"Hadoop"),(1,"Spark"),(1,"Flink"),(2,"Java"),
      (2,"Spring boot"),(3,"Linux"),(4,"VUE"))
    val data = env.fromCollection(info)
    data.groupBy(0).sortGroup(1,Order.DESCENDING).first(2).print()
  }

  def mapPartitionFunction(env: ExecutionEnvironment): Unit = {
    val students = new ListBuffer[String]
    for(i <- 1 to 100) {
      students.append("student: " + i)
    }
    val data = env.fromCollection(students).setParallelism(4)
    data.map(student => {
      val connection = DBUntils.getConnection()
      //TODO 數據庫操作
      DBUntils.returnConnection(connection)
      connection
    }).print()
    println("-----------")
    data.mapPartition((student,collector: Collector[Int]) => {
      val connection = DBUntils.getConnection()
      //TODO 數據庫操作
      DBUntils.returnConnection(connection)
      collector.collect(connection)
    }).print()
  }

  def filterFunction(env: ExecutionEnvironment): Unit = {
    val data = env.fromCollection(List(1,2,3,4,5,6,7,8,9,10))
    data.map(_ + 1).filter(_ > 5).print()
  }

  def mapFunction(env: ExecutionEnvironment): Unit = {
    val data = env.fromCollection(List(1,2,3,4,5,6,7,8,9,10))
    data.map(_ + 1).print()
  }
}

運行結果(省略日誌)

(3,Linux)
(1,Spark)
(1,Hadoop)
(4,VUE)
(2,Spring boot)
(2,Java)

flatMap算子

public class DataSetTransformationApp {
    public static void main(String[] args) throws Exception {
        ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
//        mapFunction(env);
//        filterFunction(env);
//        mapPartitionFunction(env);
//        firstFunction(env);
        flatMapFunction(env);
    }

    public static void flatMapFunction(ExecutionEnvironment env) throws Exception {
        List<String> info = Arrays.asList("hadoop,spark","hadoop,flink","flink,flink");
        DataSource<String> data = env.fromCollection(info);
        data.flatMap((FlatMapFunction<String,String>)(value, collector) -> {
            String tokens[] = value.split(",");
            Stream.of(tokens).forEach(collector::collect);
        }).returns(String.class).print();
    }

    public static void firstFunction(ExecutionEnvironment env) throws Exception {
        List<Tuple2<Integer,String>> info = Arrays.asList(new Tuple2<>(1,"Hadoop"),
                new Tuple2<>(1,"Spark"),
                new Tuple2<>(1,"Flink"),
                new Tuple2<>(2,"Java"),
                new Tuple2<>(2,"Spring boot"),
                new Tuple2<>(3,"Linux"),
                new Tuple2<>(4,"VUE"));
        DataSource<Tuple2<Integer, String>> data = env.fromCollection(info);
        data.groupBy(0).sortGroup(1, Order.DESCENDING).first(2).print();
    }

    public static void mapPartitionFunction(ExecutionEnvironment env) throws Exception {
        List<String> students = new ArrayList<>();
        for (int i = 0; i < 100; i++) {
            students.add("student: " + i);
        }
        DataSource<String> data = env.fromCollection(students).setParallelism(4);
        //此處會按照students的數量進行轉換
        data.map(student -> {
            int connection = DBUntils.getConnection();
            //TODO 數據庫操作
            DBUntils.returnConnection(connection);
            return connection;
        }).print();
        System.out.println("-----------");
        //此處會按照並行度的數量進行轉換
        data.mapPartition((MapPartitionFunction<String,Integer>)(student, collector) -> {
            int connection = DBUntils.getConnection();
            //TODO 數據庫操作
            DBUntils.returnConnection(connection);
            collector.collect(connection);
        }).returns(Integer.class).print();
    }

    public static void filterFunction(ExecutionEnvironment env) throws Exception {
        DataSource<Integer> data = env.fromCollection(Arrays.asList(1,2,3,4,5,6,7,8,9,10));
        data.map(x -> x + 1).filter(x -> x > 5).print();
    }

    public static void mapFunction(ExecutionEnvironment env) throws Exception {
        DataSource<Integer> data = env.fromCollection(Arrays.asList(1,2,3,4,5,6,7,8,9,10));
        data.map(x -> x + 1).print();
    }
}

運行結果(省略日誌)

hadoop
spark
hadoop
flink
flink
flink

Scala代碼

import com.guanjian.flink.scala.until.DBUntils
import org.apache.flink.api.common.operators.Order
import org.apache.flink.api.scala.ExecutionEnvironment
import org.apache.flink.api.scala._
import org.apache.flink.util.Collector

import scala.collection.mutable.ListBuffer

object DataSetTransformationApp {
  def main(args: Array[String]): Unit = {
    val env = ExecutionEnvironment.getExecutionEnvironment
//    mapFunction(env)
//    filterFunction(env)
//    mapPartitionFunction(env)
//    firstFunction(env)
    flatMapFunction(env)
  }

  def flatMapFunction(env: ExecutionEnvironment): Unit = {
    val info = List("hadoop,spark","hadoop,flink","flink,flink")
    val data = env.fromCollection(info)
    data.flatMap(_.split(",")).print()
  }

  def firstFunction(env: ExecutionEnvironment): Unit = {
    val info = List((1,"Hadoop"),(1,"Spark"),(1,"Flink"),(2,"Java"),
      (2,"Spring boot"),(3,"Linux"),(4,"VUE"))
    val data = env.fromCollection(info)
    data.groupBy(0).sortGroup(1,Order.DESCENDING).first(2).print()
  }

  def mapPartitionFunction(env: ExecutionEnvironment): Unit = {
    val students = new ListBuffer[String]
    for(i <- 1 to 100) {
      students.append("student: " + i)
    }
    val data = env.fromCollection(students).setParallelism(4)
    data.map(student => {
      val connection = DBUntils.getConnection()
      //TODO 數據庫操作
      DBUntils.returnConnection(connection)
      connection
    }).print()
    println("-----------")
    data.mapPartition((student,collector: Collector[Int]) => {
      val connection = DBUntils.getConnection()
      //TODO 數據庫操作
      DBUntils.returnConnection(connection)
      collector.collect(connection)
    }).print()
  }

  def filterFunction(env: ExecutionEnvironment): Unit = {
    val data = env.fromCollection(List(1,2,3,4,5,6,7,8,9,10))
    data.map(_ + 1).filter(_ > 5).print()
  }

  def mapFunction(env: ExecutionEnvironment): Unit = {
    val data = env.fromCollection(List(1,2,3,4,5,6,7,8,9,10))
    data.map(_ + 1).print()
  }
}

運行結果(省略日誌)

hadoop
spark
hadoop
flink
flink
flink

當然它也支持跟Java同樣的寫法

def flatMapFunction(env: ExecutionEnvironment): Unit = {
  val info = List("hadoop,spark","hadoop,flink","flink,flink")
  val data = env.fromCollection(info)
  data.flatMap((value,collector: Collector[String]) => {
    val tokens = value.split(",")
    tokens.foreach(collector.collect(_))
  }).print()
}

統計單詞數量

public class DataSetTransformationApp {
    public static void main(String[] args) throws Exception {
        ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
//        mapFunction(env);
//        filterFunction(env);
//        mapPartitionFunction(env);
//        firstFunction(env);
        flatMapFunction(env);
    }

    public static void flatMapFunction(ExecutionEnvironment env) throws Exception {
        List<String> info = Arrays.asList("hadoop,spark", "hadoop,flink", "flink,flink");
        DataSource<String> data = env.fromCollection(info);
        data.flatMap((FlatMapFunction<String, String>) (value, collector) -> {
            String tokens[] = value.split(",");
            Stream.of(tokens).forEach(collector::collect);
        }).returns(String.class)
                .map(new MapFunction<String, Tuple2<String,Integer>>() {
                    @Override
                    public Tuple2<String, Integer> map(String value) throws Exception {
                        return new Tuple2<>(value,1);
                    }
                })
            .groupBy(0).sum(1).print();
    }

    public static void firstFunction(ExecutionEnvironment env) throws Exception {
        List<Tuple2<Integer,String>> info = Arrays.asList(new Tuple2<>(1,"Hadoop"),
                new Tuple2<>(1,"Spark"),
                new Tuple2<>(1,"Flink"),
                new Tuple2<>(2,"Java"),
                new Tuple2<>(2,"Spring boot"),
                new Tuple2<>(3,"Linux"),
                new Tuple2<>(4,"VUE"));
        DataSource<Tuple2<Integer, String>> data = env.fromCollection(info);
        data.groupBy(0).sortGroup(1, Order.DESCENDING).first(2).print();
    }

    public static void mapPartitionFunction(ExecutionEnvironment env) throws Exception {
        List<String> students = new ArrayList<>();
        for (int i = 0; i < 100; i++) {
            students.add("student: " + i);
        }
        DataSource<String> data = env.fromCollection(students).setParallelism(4);
        //此處會按照students的數量進行轉換
        data.map(student -> {
            int connection = DBUntils.getConnection();
            //TODO 數據庫操作
            DBUntils.returnConnection(connection);
            return connection;
        }).print();
        System.out.println("-----------");
        //此處會按照並行度的數量進行轉換
        data.mapPartition((MapPartitionFunction<String,Integer>)(student, collector) -> {
            int connection = DBUntils.getConnection();
            //TODO 數據庫操作
            DBUntils.returnConnection(connection);
            collector.collect(connection);
        }).returns(Integer.class).print();
    }

    public static void filterFunction(ExecutionEnvironment env) throws Exception {
        DataSource<Integer> data = env.fromCollection(Arrays.asList(1,2,3,4,5,6,7,8,9,10));
        data.map(x -> x + 1).filter(x -> x > 5).print();
    }

    public static void mapFunction(ExecutionEnvironment env) throws Exception {
        DataSource<Integer> data = env.fromCollection(Arrays.asList(1,2,3,4,5,6,7,8,9,10));
        data.map(x -> x + 1).print();
    }
}

運行結果(省略日誌)

(hadoop,2)
(flink,3)
(spark,1)

Scala代碼

import com.guanjian.flink.scala.until.DBUntils
import org.apache.flink.api.common.operators.Order
import org.apache.flink.api.scala.ExecutionEnvironment
import org.apache.flink.api.scala._
import org.apache.flink.util.Collector

import scala.collection.mutable.ListBuffer

object DataSetTransformationApp {
  def main(args: Array[String]): Unit = {
    val env = ExecutionEnvironment.getExecutionEnvironment
//    mapFunction(env)
//    filterFunction(env)
//    mapPartitionFunction(env)
//    firstFunction(env)
    flatMapFunction(env)
  }

  def flatMapFunction(env: ExecutionEnvironment): Unit = {
    val info = List("hadoop,spark","hadoop,flink","flink,flink")
    val data = env.fromCollection(info)
    data.flatMap(_.split(",")).map((_,1)).groupBy(0)
      .sum(1).print()
  }

  def firstFunction(env: ExecutionEnvironment): Unit = {
    val info = List((1,"Hadoop"),(1,"Spark"),(1,"Flink"),(2,"Java"),
      (2,"Spring boot"),(3,"Linux"),(4,"VUE"))
    val data = env.fromCollection(info)
    data.groupBy(0).sortGroup(1,Order.DESCENDING).first(2).print()
  }

  def mapPartitionFunction(env: ExecutionEnvironment): Unit = {
    val students = new ListBuffer[String]
    for(i <- 1 to 100) {
      students.append("student: " + i)
    }
    val data = env.fromCollection(students).setParallelism(4)
    data.map(student => {
      val connection = DBUntils.getConnection()
      //TODO 數據庫操作
      DBUntils.returnConnection(connection)
      connection
    }).print()
    println("-----------")
    data.mapPartition((student,collector: Collector[Int]) => {
      val connection = DBUntils.getConnection()
      //TODO 數據庫操作
      DBUntils.returnConnection(connection)
      collector.collect(connection)
    }).print()
  }

  def filterFunction(env: ExecutionEnvironment): Unit = {
    val data = env.fromCollection(List(1,2,3,4,5,6,7,8,9,10))
    data.map(_ + 1).filter(_ > 5).print()
  }

  def mapFunction(env: ExecutionEnvironment): Unit = {
    val data = env.fromCollection(List(1,2,3,4,5,6,7,8,9,10))
    data.map(_ + 1).print()
  }
}

運行結果(省略日誌)

(hadoop,2)
(flink,3)
(spark,1)

distinct算子

public class DataSetTransformationApp {
    public static void main(String[] args) throws Exception {
        ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
//        mapFunction(env);
//        filterFunction(env);
//        mapPartitionFunction(env);
//        firstFunction(env);
//        flatMapFunction(env);
        distinctFunction(env);
    }

    public static void distinctFunction(ExecutionEnvironment env) throws Exception {
        List<String> info = Arrays.asList("hadoop,spark", "hadoop,flink", "flink,flink");
        DataSource<String> data = env.fromCollection(info);
        data.flatMap((FlatMapFunction<String, String>) (value, collector) -> {
            String tokens[] = value.split(",");
            Stream.of(tokens).forEach(collector::collect);
        }).returns(String.class)
                .distinct().print();
    }

    public static void flatMapFunction(ExecutionEnvironment env) throws Exception {
        List<String> info = Arrays.asList("hadoop,spark", "hadoop,flink", "flink,flink");
        DataSource<String> data = env.fromCollection(info);
        data.flatMap((FlatMapFunction<String, String>) (value, collector) -> {
            String tokens[] = value.split(",");
            Stream.of(tokens).forEach(collector::collect);
        }).returns(String.class)
                .map(new MapFunction<String, Tuple2<String,Integer>>() {
                    @Override
                    public Tuple2<String, Integer> map(String value) throws Exception {
                        return new Tuple2<>(value,1);
                    }
                })
            .groupBy(0).sum(1).print();
    }

    public static void firstFunction(ExecutionEnvironment env) throws Exception {
        List<Tuple2<Integer,String>> info = Arrays.asList(new Tuple2<>(1,"Hadoop"),
                new Tuple2<>(1,"Spark"),
                new Tuple2<>(1,"Flink"),
                new Tuple2<>(2,"Java"),
                new Tuple2<>(2,"Spring boot"),
                new Tuple2<>(3,"Linux"),
                new Tuple2<>(4,"VUE"));
        DataSource<Tuple2<Integer, String>> data = env.fromCollection(info);
        data.groupBy(0).sortGroup(1, Order.DESCENDING).first(2).print();
    }

    public static void mapPartitionFunction(ExecutionEnvironment env) throws Exception {
        List<String> students = new ArrayList<>();
        for (int i = 0; i < 100; i++) {
            students.add("student: " + i);
        }
        DataSource<String> data = env.fromCollection(students).setParallelism(4);
        //此處會按照students的數量進行轉換
        data.map(student -> {
            int connection = DBUntils.getConnection();
            //TODO 數據庫操作
            DBUntils.returnConnection(connection);
            return connection;
        }).print();
        System.out.println("-----------");
        //此處會按照並行度的數量進行轉換
        data.mapPartition((MapPartitionFunction<String,Integer>)(student, collector) -> {
            int connection = DBUntils.getConnection();
            //TODO 數據庫操作
            DBUntils.returnConnection(connection);
            collector.collect(connection);
        }).returns(Integer.class).print();
    }

    public static void filterFunction(ExecutionEnvironment env) throws Exception {
        DataSource<Integer> data = env.fromCollection(Arrays.asList(1,2,3,4,5,6,7,8,9,10));
        data.map(x -> x + 1).filter(x -> x > 5).print();
    }

    public static void mapFunction(ExecutionEnvironment env) throws Exception {
        DataSource<Integer> data = env.fromCollection(Arrays.asList(1,2,3,4,5,6,7,8,9,10));
        data.map(x -> x + 1).print();
    }
}

運行結果(省略日誌)

hadoop
flink
spark

Scala代碼

import com.guanjian.flink.scala.until.DBUntils
import org.apache.flink.api.common.operators.Order
import org.apache.flink.api.scala.ExecutionEnvironment
import org.apache.flink.api.scala._
import org.apache.flink.util.Collector

import scala.collection.mutable.ListBuffer

object DataSetTransformationApp {
  def main(args: Array[String]): Unit = {
    val env = ExecutionEnvironment.getExecutionEnvironment
//    mapFunction(env)
//    filterFunction(env)
//    mapPartitionFunction(env)
//    firstFunction(env)
//    flatMapFunction(env)
    distinctFunction(env)
  }

  def distinctFunction(env: ExecutionEnvironment): Unit = {
    val info = List("hadoop,spark","hadoop,flink","flink,flink")
    val data = env.fromCollection(info)
    data.flatMap(_.split(",")).distinct().print()
  }

  def flatMapFunction(env: ExecutionEnvironment): Unit = {
    val info = List("hadoop,spark","hadoop,flink","flink,flink")
    val data = env.fromCollection(info)
    data.flatMap(_.split(",")).map((_,1)).groupBy(0)
      .sum(1).print()
  }

  def firstFunction(env: ExecutionEnvironment): Unit = {
    val info = List((1,"Hadoop"),(1,"Spark"),(1,"Flink"),(2,"Java"),
      (2,"Spring boot"),(3,"Linux"),(4,"VUE"))
    val data = env.fromCollection(info)
    data.groupBy(0).sortGroup(1,Order.DESCENDING).first(2).print()
  }

  def mapPartitionFunction(env: ExecutionEnvironment): Unit = {
    val students = new ListBuffer[String]
    for(i <- 1 to 100) {
      students.append("student: " + i)
    }
    val data = env.fromCollection(students).setParallelism(4)
    data.map(student => {
      val connection = DBUntils.getConnection()
      //TODO 數據庫操作
      DBUntils.returnConnection(connection)
      connection
    }).print()
    println("-----------")
    data.mapPartition((student,collector: Collector[Int]) => {
      val connection = DBUntils.getConnection()
      //TODO 數據庫操作
      DBUntils.returnConnection(connection)
      collector.collect(connection)
    }).print()
  }

  def filterFunction(env: ExecutionEnvironment): Unit = {
    val data = env.fromCollection(List(1,2,3,4,5,6,7,8,9,10))
    data.map(_ + 1).filter(_ > 5).print()
  }

  def mapFunction(env: ExecutionEnvironment): Unit = {
    val data = env.fromCollection(List(1,2,3,4,5,6,7,8,9,10))
    data.map(_ + 1).print()
  }
}

運行結果(省略日誌)

hadoop
flink
spark

join算子

public class DataSetTransformationApp {
    public static void main(String[] args) throws Exception {
        ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
//        mapFunction(env);
//        filterFunction(env);
//        mapPartitionFunction(env);
//        firstFunction(env);
//        flatMapFunction(env);
//        distinctFunction(env);
        joinFunction(env);
    }

    public static void joinFunction(ExecutionEnvironment env) throws Exception {
        List<Tuple2<Integer,String>> info1 = Arrays.asList(new Tuple2<>(1,"PK哥"),
                new Tuple2<>(2,"J哥"),
                new Tuple2<>(3,"小隊長"),
                new Tuple2<>(4,"天空藍"));
        List<Tuple2<Integer,String>> info2 = Arrays.asList(new Tuple2<>(1,"北京"),
                new Tuple2<>(2,"上海"),
                new Tuple2<>(3,"成都"),
                new Tuple2<>(4,"杭州"));
        DataSource<Tuple2<Integer, String>> data1 = env.fromCollection(info1);
        DataSource<Tuple2<Integer, String>> data2 = env.fromCollection(info2);
        data1.join(data2).where(0).equalTo(0)
                .with(new JoinFunction<Tuple2<Integer,String>, Tuple2<Integer,String>, Tuple3<Integer,String,String>>() {
                    @Override
                    public Tuple3<Integer, String, String> join(Tuple2<Integer, String> first, Tuple2<Integer, String> second) throws Exception {
                        return new Tuple3<>(first.getField(0),first.getField(1),second.getField(1));
                    }
                }).print();
    }

    public static void distinctFunction(ExecutionEnvironment env) throws Exception {
        List<String> info = Arrays.asList("hadoop,spark", "hadoop,flink", "flink,flink");
        DataSource<String> data = env.fromCollection(info);
        data.flatMap((FlatMapFunction<String, String>) (value, collector) -> {
            String tokens[] = value.split(",");
            Stream.of(tokens).forEach(collector::collect);
        }).returns(String.class)
                .distinct().print();
    }

    public static void flatMapFunction(ExecutionEnvironment env) throws Exception {
        List<String> info = Arrays.asList("hadoop,spark", "hadoop,flink", "flink,flink");
        DataSource<String> data = env.fromCollection(info);
        data.flatMap((FlatMapFunction<String, String>) (value, collector) -> {
            String tokens[] = value.split(",");
            Stream.of(tokens).forEach(collector::collect);
        }).returns(String.class)
                .map(new MapFunction<String, Tuple2<String,Integer>>() {
                    @Override
                    public Tuple2<String, Integer> map(String value) throws Exception {
                        return new Tuple2<>(value,1);
                    }
                })
            .groupBy(0).sum(1).print();
    }

    public static void firstFunction(ExecutionEnvironment env) throws Exception {
        List<Tuple2<Integer,String>> info = Arrays.asList(new Tuple2<>(1,"Hadoop"),
                new Tuple2<>(1,"Spark"),
                new Tuple2<>(1,"Flink"),
                new Tuple2<>(2,"Java"),
                new Tuple2<>(2,"Spring boot"),
                new Tuple2<>(3,"Linux"),
                new Tuple2<>(4,"VUE"));
        DataSource<Tuple2<Integer, String>> data = env.fromCollection(info);
        data.groupBy(0).sortGroup(1, Order.DESCENDING).first(2).print();
    }

    public static void mapPartitionFunction(ExecutionEnvironment env) throws Exception {
        List<String> students = new ArrayList<>();
        for (int i = 0; i < 100; i++) {
            students.add("student: " + i);
        }
        DataSource<String> data = env.fromCollection(students).setParallelism(4);
        //此處會按照students的數量進行轉換
        data.map(student -> {
            int connection = DBUntils.getConnection();
            //TODO 數據庫操作
            DBUntils.returnConnection(connection);
            return connection;
        }).print();
        System.out.println("-----------");
        //此處會按照並行度的數量進行轉換
        data.mapPartition((MapPartitionFunction<String,Integer>)(student, collector) -> {
            int connection = DBUntils.getConnection();
            //TODO 數據庫操作
            DBUntils.returnConnection(connection);
            collector.collect(connection);
        }).returns(Integer.class).print();
    }

    public static void filterFunction(ExecutionEnvironment env) throws Exception {
        DataSource<Integer> data = env.fromCollection(Arrays.asList(1,2,3,4,5,6,7,8,9,10));
        data.map(x -> x + 1).filter(x -> x > 5).print();
    }

    public static void mapFunction(ExecutionEnvironment env) throws Exception {
        DataSource<Integer> data = env.fromCollection(Arrays.asList(1,2,3,4,5,6,7,8,9,10));
        data.map(x -> x + 1).print();
    }
}

運行結果(省略日誌)

(3,小隊長,成都)
(1,PK哥,北京)
(4,天空藍,杭州)
(2,J哥,上海)

Scala代碼

import com.guanjian.flink.scala.until.DBUntils
import org.apache.flink.api.common.operators.Order
import org.apache.flink.api.scala.ExecutionEnvironment
import org.apache.flink.api.scala._
import org.apache.flink.util.Collector

import scala.collection.mutable.ListBuffer

object DataSetTransformationApp {
  def main(args: Array[String]): Unit = {
    val env = ExecutionEnvironment.getExecutionEnvironment
//    mapFunction(env)
//    filterFunction(env)
//    mapPartitionFunction(env)
//    firstFunction(env)
//    flatMapFunction(env)
//    distinctFunction(env)
    joinFunction(env)
  }

  def joinFunction(env: ExecutionEnvironment): Unit = {
    val info1 = List((1,"PK哥"),(2,"J哥"),(3,"小隊長"),(4,"天空藍"))
    val info2 = List((1,"北京"),(2,"上海"),(3,"成都"),(4,"杭州"))
    val data1 = env.fromCollection(info1)
    val data2 = env.fromCollection(info2)
    data1.join(data2).where(0).equalTo(0).apply((first,second) =>
      (first._1,first._2,second._2)
    ).print()
  }

  def distinctFunction(env: ExecutionEnvironment): Unit = {
    val info = List("hadoop,spark","hadoop,flink","flink,flink")
    val data = env.fromCollection(info)
    data.flatMap(_.split(",")).distinct().print()
  }

  def flatMapFunction(env: ExecutionEnvironment): Unit = {
    val info = List("hadoop,spark","hadoop,flink","flink,flink")
    val data = env.fromCollection(info)
    data.flatMap(_.split(",")).map((_,1)).groupBy(0)
      .sum(1).print()
  }

  def firstFunction(env: ExecutionEnvironment): Unit = {
    val info = List((1,"Hadoop"),(1,"Spark"),(1,"Flink"),(2,"Java"),
      (2,"Spring boot"),(3,"Linux"),(4,"VUE"))
    val data = env.fromCollection(info)
    data.groupBy(0).sortGroup(1,Order.DESCENDING).first(2).print()
  }

  def mapPartitionFunction(env: ExecutionEnvironment): Unit = {
    val students = new ListBuffer[String]
    for(i <- 1 to 100) {
      students.append("student: " + i)
    }
    val data = env.fromCollection(students).setParallelism(4)
    data.map(student => {
      val connection = DBUntils.getConnection()
      //TODO 數據庫操作
      DBUntils.returnConnection(connection)
      connection
    }).print()
    println("-----------")
    data.mapPartition((student,collector: Collector[Int]) => {
      val connection = DBUntils.getConnection()
      //TODO 數據庫操作
      DBUntils.returnConnection(connection)
      collector.collect(connection)
    }).print()
  }

  def filterFunction(env: ExecutionEnvironment): Unit = {
    val data = env.fromCollection(List(1,2,3,4,5,6,7,8,9,10))
    data.map(_ + 1).filter(_ > 5).print()
  }

  def mapFunction(env: ExecutionEnvironment): Unit = {
    val data = env.fromCollection(List(1,2,3,4,5,6,7,8,9,10))
    data.map(_ + 1).print()
  }
}

運行結果(省略日誌)

(3,小隊長,成都)
(1,PK哥,北京)
(4,天空藍,杭州)
(2,J哥,上海)

outJoin算子

左連接

public class DataSetTransformationApp {
    public static void main(String[] args) throws Exception {
        ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
//        mapFunction(env);
//        filterFunction(env);
//        mapPartitionFunction(env);
//        firstFunction(env);
//        flatMapFunction(env);
//        distinctFunction(env);
//        joinFunction(env);
        outJoinFunction(env);
    }

    public static void outJoinFunction(ExecutionEnvironment env) throws Exception {
        List<Tuple2<Integer,String>> info1 = Arrays.asList(new Tuple2<>(1,"PK哥"),
                new Tuple2<>(2,"J哥"),
                new Tuple2<>(3,"小隊長"),
                new Tuple2<>(4,"天空藍"));
        List<Tuple2<Integer,String>> info2 = Arrays.asList(new Tuple2<>(1,"北京"),
                new Tuple2<>(2,"上海"),
                new Tuple2<>(3,"成都"),
                new Tuple2<>(5,"杭州"));
        DataSource<Tuple2<Integer, String>> data1 = env.fromCollection(info1);
        DataSource<Tuple2<Integer, String>> data2 = env.fromCollection(info2);
        data1.leftOuterJoin(data2).where(0).equalTo(0)
                .with(new JoinFunction<Tuple2<Integer,String>, Tuple2<Integer,String>, Tuple3<Integer,String,String>>() {
                    @Override
                    public Tuple3<Integer, String, String> join(Tuple2<Integer, String> first, Tuple2<Integer, String> second) throws Exception {
                        if (second == null) {
                            return new Tuple3<>(first.getField(0),first.getField(1),"-");
                        }
                        return new Tuple3<>(first.getField(0),first.getField(1),second.getField(1));
                    }
                }).print();
    }

    public static void joinFunction(ExecutionEnvironment env) throws Exception {
        List<Tuple2<Integer,String>> info1 = Arrays.asList(new Tuple2<>(1,"PK哥"),
                new Tuple2<>(2,"J哥"),
                new Tuple2<>(3,"小隊長"),
                new Tuple2<>(4,"天空藍"));
        List<Tuple2<Integer,String>> info2 = Arrays.asList(new Tuple2<>(1,"北京"),
                new Tuple2<>(2,"上海"),
                new Tuple2<>(3,"成都"),
                new Tuple2<>(4,"杭州"));
        DataSource<Tuple2<Integer, String>> data1 = env.fromCollection(info1);
        DataSource<Tuple2<Integer, String>> data2 = env.fromCollection(info2);
        data1.join(data2).where(0).equalTo(0)
                .with(new JoinFunction<Tuple2<Integer,String>, Tuple2<Integer,String>, Tuple3<Integer,String,String>>() {
                    @Override
                    public Tuple3<Integer, String, String> join(Tuple2<Integer, String> first, Tuple2<Integer, String> second) throws Exception {
                        return new Tuple3<>(first.getField(0),first.getField(1),second.getField(1));
                    }
                }).print();
    }

    public static void distinctFunction(ExecutionEnvironment env) throws Exception {
        List<String> info = Arrays.asList("hadoop,spark", "hadoop,flink", "flink,flink");
        DataSource<String> data = env.fromCollection(info);
        data.flatMap((FlatMapFunction<String, String>) (value, collector) -> {
            String tokens[] = value.split(",");
            Stream.of(tokens).forEach(collector::collect);
        }).returns(String.class)
                .distinct().print();
    }

    public static void flatMapFunction(ExecutionEnvironment env) throws Exception {
        List<String> info = Arrays.asList("hadoop,spark", "hadoop,flink", "flink,flink");
        DataSource<String> data = env.fromCollection(info);
        data.flatMap((FlatMapFunction<String, String>) (value, collector) -> {
            String tokens[] = value.split(",");
            Stream.of(tokens).forEach(collector::collect);
        }).returns(String.class)
                .map(new MapFunction<String, Tuple2<String,Integer>>() {
                    @Override
                    public Tuple2<String, Integer> map(String value) throws Exception {
                        return new Tuple2<>(value,1);
                    }
                })
            .groupBy(0).sum(1).print();
    }

    public static void firstFunction(ExecutionEnvironment env) throws Exception {
        List<Tuple2<Integer,String>> info = Arrays.asList(new Tuple2<>(1,"Hadoop"),
                new Tuple2<>(1,"Spark"),
                new Tuple2<>(1,"Flink"),
                new Tuple2<>(2,"Java"),
                new Tuple2<>(2,"Spring boot"),
                new Tuple2<>(3,"Linux"),
                new Tuple2<>(4,"VUE"));
        DataSource<Tuple2<Integer, String>> data = env.fromCollection(info);
        data.groupBy(0).sortGroup(1, Order.DESCENDING).first(2).print();
    }

    public static void mapPartitionFunction(ExecutionEnvironment env) throws Exception {
        List<String> students = new ArrayList<>();
        for (int i = 0; i < 100; i++) {
            students.add("student: " + i);
        }
        DataSource<String> data = env.fromCollection(students).setParallelism(4);
        //此處會按照students的數量進行轉換
        data.map(student -> {
            int connection = DBUntils.getConnection();
            //TODO 數據庫操作
            DBUntils.returnConnection(connection);
            return connection;
        }).print();
        System.out.println("-----------");
        //此處會按照並行度的數量進行轉換
        data.mapPartition((MapPartitionFunction<String,Integer>)(student, collector) -> {
            int connection = DBUntils.getConnection();
            //TODO 數據庫操作
            DBUntils.returnConnection(connection);
            collector.collect(connection);
        }).returns(Integer.class).print();
    }

    public static void filterFunction(ExecutionEnvironment env) throws Exception {
        DataSource<Integer> data = env.fromCollection(Arrays.asList(1,2,3,4,5,6,7,8,9,10));
        data.map(x -> x + 1).filter(x -> x > 5).print();
    }

    public static void mapFunction(ExecutionEnvironment env) throws Exception {
        DataSource<Integer> data = env.fromCollection(Arrays.asList(1,2,3,4,5,6,7,8,9,10));
        data.map(x -> x + 1).print();
    }
}

運行結果(省略日誌)

(3,小隊長,成都)
(1,PK哥,北京)
(4,天空藍,-)
(2,J哥,上海)

Scala代碼

import com.guanjian.flink.scala.until.DBUntils
import org.apache.flink.api.common.operators.Order
import org.apache.flink.api.scala.ExecutionEnvironment
import org.apache.flink.api.scala._
import org.apache.flink.util.Collector

import scala.collection.mutable.ListBuffer

object DataSetTransformationApp {
  def main(args: Array[String]): Unit = {
    val env = ExecutionEnvironment.getExecutionEnvironment
//    mapFunction(env)
//    filterFunction(env)
//    mapPartitionFunction(env)
//    firstFunction(env)
//    flatMapFunction(env)
//    distinctFunction(env)
    joinFunction(env)
  }

  def joinFunction(env: ExecutionEnvironment): Unit = {
    val info1 = List((1,"PK哥"),(2,"J哥"),(3,"小隊長"),(4,"天空藍"))
    val info2 = List((1,"北京"),(2,"上海"),(3,"成都"),(5,"杭州"))
    val data1 = env.fromCollection(info1)
    val data2 = env.fromCollection(info2)
    data1.leftOuterJoin(data2).where(0).equalTo(0).apply((first,second) => {
      if (second == null) {
        (first._1,first._2,"-")
      }else {
        (first._1,first._2,second._2)
      }
    }).print()
  }

  def distinctFunction(env: ExecutionEnvironment): Unit = {
    val info = List("hadoop,spark","hadoop,flink","flink,flink")
    val data = env.fromCollection(info)
    data.flatMap(_.split(",")).distinct().print()
  }

  def flatMapFunction(env: ExecutionEnvironment): Unit = {
    val info = List("hadoop,spark","hadoop,flink","flink,flink")
    val data = env.fromCollection(info)
    data.flatMap(_.split(",")).map((_,1)).groupBy(0)
      .sum(1).print()
  }

  def firstFunction(env: ExecutionEnvironment): Unit = {
    val info = List((1,"Hadoop"),(1,"Spark"),(1,"Flink"),(2,"Java"),
      (2,"Spring boot"),(3,"Linux"),(4,"VUE"))
    val data = env.fromCollection(info)
    data.groupBy(0).sortGroup(1,Order.DESCENDING).first(2).print()
  }

  def mapPartitionFunction(env: ExecutionEnvironment): Unit = {
    val students = new ListBuffer[String]
    for(i <- 1 to 100) {
      students.append("student: " + i)
    }
    val data = env.fromCollection(students).setParallelism(4)
    data.map(student => {
      val connection = DBUntils.getConnection()
      //TODO 數據庫操作
      DBUntils.returnConnection(connection)
      connection
    }).print()
    println("-----------")
    data.mapPartition((student,collector: Collector[Int]) => {
      val connection = DBUntils.getConnection()
      //TODO 數據庫操作
      DBUntils.returnConnection(connection)
      collector.collect(connection)
    }).print()
  }

  def filterFunction(env: ExecutionEnvironment): Unit = {
    val data = env.fromCollection(List(1,2,3,4,5,6,7,8,9,10))
    data.map(_ + 1).filter(_ > 5).print()
  }

  def mapFunction(env: ExecutionEnvironment): Unit = {
    val data = env.fromCollection(List(1,2,3,4,5,6,7,8,9,10))
    data.map(_ + 1).print()
  }
}

運行結果(省略日誌)

(3,小隊長,成都)
(1,PK哥,北京)
(4,天空藍,-)
(2,J哥,上海)

右連接

public class DataSetTransformationApp {
    public static void main(String[] args) throws Exception {
        ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
//        mapFunction(env);
//        filterFunction(env);
//        mapPartitionFunction(env);
//        firstFunction(env);
//        flatMapFunction(env);
//        distinctFunction(env);
//        joinFunction(env);
        outJoinFunction(env);
    }

    public static void outJoinFunction(ExecutionEnvironment env) throws Exception {
        List<Tuple2<Integer,String>> info1 = Arrays.asList(new Tuple2<>(1,"PK哥"),
                new Tuple2<>(2,"J哥"),
                new Tuple2<>(3,"小隊長"),
                new Tuple2<>(4,"天空藍"));
        List<Tuple2<Integer,String>> info2 = Arrays.asList(new Tuple2<>(1,"北京"),
                new Tuple2<>(2,"上海"),
                new Tuple2<>(3,"成都"),
                new Tuple2<>(5,"杭州"));
        DataSource<Tuple2<Integer, String>> data1 = env.fromCollection(info1);
        DataSource<Tuple2<Integer, String>> data2 = env.fromCollection(info2);
        data1.rightOuterJoin(data2).where(0).equalTo(0)
                .with(new JoinFunction<Tuple2<Integer,String>, Tuple2<Integer,String>, Tuple3<Integer,String,String>>() {
                    @Override
                    public Tuple3<Integer, String, String> join(Tuple2<Integer, String> first, Tuple2<Integer, String> second) throws Exception {
                        if (first == null) {
                            return new Tuple3<>(second.getField(0),"-",second.getField(1));
                        }
                        return new Tuple3<>(first.getField(0),first.getField(1),second.getField(1));
                    }
                }).print();
    }

    public static void joinFunction(ExecutionEnvironment env) throws Exception {
        List<Tuple2<Integer,String>> info1 = Arrays.asList(new Tuple2<>(1,"PK哥"),
                new Tuple2<>(2,"J哥"),
                new Tuple2<>(3,"小隊長"),
                new Tuple2<>(4,"天空藍"));
        List<Tuple2<Integer,String>> info2 = Arrays.asList(new Tuple2<>(1,"北京"),
                new Tuple2<>(2,"上海"),
                new Tuple2<>(3,"成都"),
                new Tuple2<>(4,"杭州"));
        DataSource<Tuple2<Integer, String>> data1 = env.fromCollection(info1);
        DataSource<Tuple2<Integer, String>> data2 = env.fromCollection(info2);
        data1.join(data2).where(0).equalTo(0)
                .with(new JoinFunction<Tuple2<Integer,String>, Tuple2<Integer,String>, Tuple3<Integer,String,String>>() {
                    @Override
                    public Tuple3<Integer, String, String> join(Tuple2<Integer, String> first, Tuple2<Integer, String> second) throws Exception {
                        return new Tuple3<>(first.getField(0),first.getField(1),second.getField(1));
                    }
                }).print();
    }

    public static void distinctFunction(ExecutionEnvironment env) throws Exception {
        List<String> info = Arrays.asList("hadoop,spark", "hadoop,flink", "flink,flink");
        DataSource<String> data = env.fromCollection(info);
        data.flatMap((FlatMapFunction<String, String>) (value, collector) -> {
            String tokens[] = value.split(",");
            Stream.of(tokens).forEach(collector::collect);
        }).returns(String.class)
                .distinct().print();
    }

    public static void flatMapFunction(ExecutionEnvironment env) throws Exception {
        List<String> info = Arrays.asList("hadoop,spark", "hadoop,flink", "flink,flink");
        DataSource<String> data = env.fromCollection(info);
        data.flatMap((FlatMapFunction<String, String>) (value, collector) -> {
            String tokens[] = value.split(",");
            Stream.of(tokens).forEach(collector::collect);
        }).returns(String.class)
                .map(new MapFunction<String, Tuple2<String,Integer>>() {
                    @Override
                    public Tuple2<String, Integer> map(String value) throws Exception {
                        return new Tuple2<>(value,1);
                    }
                })
            .groupBy(0).sum(1).print();
    }

    public static void firstFunction(ExecutionEnvironment env) throws Exception {
        List<Tuple2<Integer,String>> info = Arrays.asList(new Tuple2<>(1,"Hadoop"),
                new Tuple2<>(1,"Spark"),
                new Tuple2<>(1,"Flink"),
                new Tuple2<>(2,"Java"),
                new Tuple2<>(2,"Spring boot"),
                new Tuple2<>(3,"Linux"),
                new Tuple2<>(4,"VUE"));
        DataSource<Tuple2<Integer, String>> data = env.fromCollection(info);
        data.groupBy(0).sortGroup(1, Order.DESCENDING).first(2).print();
    }

    public static void mapPartitionFunction(ExecutionEnvironment env) throws Exception {
        List<String> students = new ArrayList<>();
        for (int i = 0; i < 100; i++) {
            students.add("student: " + i);
        }
        DataSource<String> data = env.fromCollection(students).setParallelism(4);
        //此處會按照students的數量進行轉換
        data.map(student -> {
            int connection = DBUntils.getConnection();
            //TODO 數據庫操作
            DBUntils.returnConnection(connection);
            return connection;
        }).print();
        System.out.println("-----------");
        //此處會按照並行度的數量進行轉換
        data.mapPartition((MapPartitionFunction<String,Integer>)(student, collector) -> {
            int connection = DBUntils.getConnection();
            //TODO 數據庫操作
            DBUntils.returnConnection(connection);
            collector.collect(connection);
        }).returns(Integer.class).print();
    }

    public static void filterFunction(ExecutionEnvironment env) throws Exception {
        DataSource<Integer> data = env.fromCollection(Arrays.asList(1,2,3,4,5,6,7,8,9,10));
        data.map(x -> x + 1).filter(x -> x > 5).print();
    }

    public static void mapFunction(ExecutionEnvironment env) throws Exception {
        DataSource<Integer> data = env.fromCollection(Arrays.asList(1,2,3,4,5,6,7,8,9,10));
        data.map(x -> x + 1).print();
    }
}

運行結果(省略日誌)

(3,小隊長,成都)
(1,PK哥,北京)
(5,-,杭州)
(2,J哥,上海)

Scala代碼

import com.guanjian.flink.scala.until.DBUntils
import org.apache.flink.api.common.operators.Order
import org.apache.flink.api.scala.ExecutionEnvironment
import org.apache.flink.api.scala._
import org.apache.flink.util.Collector

import scala.collection.mutable.ListBuffer

object DataSetTransformationApp {
  def main(args: Array[String]): Unit = {
    val env = ExecutionEnvironment.getExecutionEnvironment
//    mapFunction(env)
//    filterFunction(env)
//    mapPartitionFunction(env)
//    firstFunction(env)
//    flatMapFunction(env)
//    distinctFunction(env)
    joinFunction(env)
  }

  def joinFunction(env: ExecutionEnvironment): Unit = {
    val info1 = List((1,"PK哥"),(2,"J哥"),(3,"小隊長"),(4,"天空藍"))
    val info2 = List((1,"北京"),(2,"上海"),(3,"成都"),(5,"杭州"))
    val data1 = env.fromCollection(info1)
    val data2 = env.fromCollection(info2)
    data1.rightOuterJoin(data2).where(0).equalTo(0).apply((first,second) => {
      if (first == null) {
        (second._1,"-",second._2)
      }else {
        (first._1,first._2,second._2)
      }
    }).print()
  }

  def distinctFunction(env: ExecutionEnvironment): Unit = {
    val info = List("hadoop,spark","hadoop,flink","flink,flink")
    val data = env.fromCollection(info)
    data.flatMap(_.split(",")).distinct().print()
  }

  def flatMapFunction(env: ExecutionEnvironment): Unit = {
    val info = List("hadoop,spark","hadoop,flink","flink,flink")
    val data = env.fromCollection(info)
    data.flatMap(_.split(",")).map((_,1)).groupBy(0)
      .sum(1).print()
  }

  def firstFunction(env: ExecutionEnvironment): Unit = {
    val info = List((1,"Hadoop"),(1,"Spark"),(1,"Flink"),(2,"Java"),
      (2,"Spring boot"),(3,"Linux"),(4,"VUE"))
    val data = env.fromCollection(info)
    data.groupBy(0).sortGroup(1,Order.DESCENDING).first(2).print()
  }

  def mapPartitionFunction(env: ExecutionEnvironment): Unit = {
    val students = new ListBuffer[String]
    for(i <- 1 to 100) {
      students.append("student: " + i)
    }
    val data = env.fromCollection(students).setParallelism(4)
    data.map(student => {
      val connection = DBUntils.getConnection()
      //TODO 數據庫操作
      DBUntils.returnConnection(connection)
      connection
    }).print()
    println("-----------")
    data.mapPartition((student,collector: Collector[Int]) => {
      val connection = DBUntils.getConnection()
      //TODO 數據庫操作
      DBUntils.returnConnection(connection)
      collector.collect(connection)
    }).print()
  }

  def filterFunction(env: ExecutionEnvironment): Unit = {
    val data = env.fromCollection(List(1,2,3,4,5,6,7,8,9,10))
    data.map(_ + 1).filter(_ > 5).print()
  }

  def mapFunction(env: ExecutionEnvironment): Unit = {
    val data = env.fromCollection(List(1,2,3,4,5,6,7,8,9,10))
    data.map(_ + 1).print()
  }
}

運行結果(省略日誌)

(3,小隊長,成都)
(1,PK哥,北京)
(5,-,杭州)
(2,J哥,上海)

全外連接

public class DataSetTransformationApp {
    public static void main(String[] args) throws Exception {
        ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
//        mapFunction(env);
//        filterFunction(env);
//        mapPartitionFunction(env);
//        firstFunction(env);
//        flatMapFunction(env);
//        distinctFunction(env);
//        joinFunction(env);
        outJoinFunction(env);
    }

    public static void outJoinFunction(ExecutionEnvironment env) throws Exception {
        List<Tuple2<Integer,String>> info1 = Arrays.asList(new Tuple2<>(1,"PK哥"),
                new Tuple2<>(2,"J哥"),
                new Tuple2<>(3,"小隊長"),
                new Tuple2<>(4,"天空藍"));
        List<Tuple2<Integer,String>> info2 = Arrays.asList(new Tuple2<>(1,"北京"),
                new Tuple2<>(2,"上海"),
                new Tuple2<>(3,"成都"),
                new Tuple2<>(5,"杭州"));
        DataSource<Tuple2<Integer, String>> data1 = env.fromCollection(info1);
        DataSource<Tuple2<Integer, String>> data2 = env.fromCollection(info2);
        data1.fullOuterJoin(data2).where(0).equalTo(0)
                .with(new JoinFunction<Tuple2<Integer,String>, Tuple2<Integer,String>, Tuple3<Integer,String,String>>() {
                    @Override
                    public Tuple3<Integer, String, String> join(Tuple2<Integer, String> first, Tuple2<Integer, String> second) throws Exception {
                        if (first == null) {
                            return new Tuple3<>(second.getField(0),"-",second.getField(1));
                        }else if (second == null) {
                            return new Tuple3<>(first.getField(0),first.getField(1),"-");
                        }
                        return new Tuple3<>(first.getField(0),first.getField(1),second.getField(1));
                    }
                }).print();
    }

    public static void joinFunction(ExecutionEnvironment env) throws Exception {
        List<Tuple2<Integer,String>> info1 = Arrays.asList(new Tuple2<>(1,"PK哥"),
                new Tuple2<>(2,"J哥"),
                new Tuple2<>(3,"小隊長"),
                new Tuple2<>(4,"天空藍"));
        List<Tuple2<Integer,String>> info2 = Arrays.asList(new Tuple2<>(1,"北京"),
                new Tuple2<>(2,"上海"),
                new Tuple2<>(3,"成都"),
                new Tuple2<>(4,"杭州"));
        DataSource<Tuple2<Integer, String>> data1 = env.fromCollection(info1);
        DataSource<Tuple2<Integer, String>> data2 = env.fromCollection(info2);
        data1.join(data2).where(0).equalTo(0)
                .with(new JoinFunction<Tuple2<Integer,String>, Tuple2<Integer,String>, Tuple3<Integer,String,String>>() {
                    @Override
                    public Tuple3<Integer, String, String> join(Tuple2<Integer, String> first, Tuple2<Integer, String> second) throws Exception {
                        return new Tuple3<>(first.getField(0),first.getField(1),second.getField(1));
                    }
                }).print();
    }

    public static void distinctFunction(ExecutionEnvironment env) throws Exception {
        List<String> info = Arrays.asList("hadoop,spark", "hadoop,flink", "flink,flink");
        DataSource<String> data = env.fromCollection(info);
        data.flatMap((FlatMapFunction<String, String>) (value, collector) -> {
            String tokens[] = value.split(",");
            Stream.of(tokens).forEach(collector::collect);
        }).returns(String.class)
                .distinct().print();
    }

    public static void flatMapFunction(ExecutionEnvironment env) throws Exception {
        List<String> info = Arrays.asList("hadoop,spark", "hadoop,flink", "flink,flink");
        DataSource<String> data = env.fromCollection(info);
        data.flatMap((FlatMapFunction<String, String>) (value, collector) -> {
            String tokens[] = value.split(",");
            Stream.of(tokens).forEach(collector::collect);
        }).returns(String.class)
                .map(new MapFunction<String, Tuple2<String,Integer>>() {
                    @Override
                    public Tuple2<String, Integer> map(String value) throws Exception {
                        return new Tuple2<>(value,1);
                    }
                })
            .groupBy(0).sum(1).print();
    }

    public static void firstFunction(ExecutionEnvironment env) throws Exception {
        List<Tuple2<Integer,String>> info = Arrays.asList(new Tuple2<>(1,"Hadoop"),
                new Tuple2<>(1,"Spark"),
                new Tuple2<>(1,"Flink"),
                new Tuple2<>(2,"Java"),
                new Tuple2<>(2,"Spring boot"),
                new Tuple2<>(3,"Linux"),
                new Tuple2<>(4,"VUE"));
        DataSource<Tuple2<Integer, String>> data = env.fromCollection(info);
        data.groupBy(0).sortGroup(1, Order.DESCENDING).first(2).print();
    }

    public static void mapPartitionFunction(ExecutionEnvironment env) throws Exception {
        List<String> students = new ArrayList<>();
        for (int i = 0; i < 100; i++) {
            students.add("student: " + i);
        }
        DataSource<String> data = env.fromCollection(students).setParallelism(4);
        //此處會按照students的數量進行轉換
        data.map(student -> {
            int connection = DBUntils.getConnection();
            //TODO 數據庫操作
            DBUntils.returnConnection(connection);
            return connection;
        }).print();
        System.out.println("-----------");
        //此處會按照並行度的數量進行轉換
        data.mapPartition((MapPartitionFunction<String,Integer>)(student, collector) -> {
            int connection = DBUntils.getConnection();
            //TODO 數據庫操作
            DBUntils.returnConnection(connection);
            collector.collect(connection);
        }).returns(Integer.class).print();
    }

    public static void filterFunction(ExecutionEnvironment env) throws Exception {
        DataSource<Integer> data = env.fromCollection(Arrays.asList(1,2,3,4,5,6,7,8,9,10));
        data.map(x -> x + 1).filter(x -> x > 5).print();
    }

    public static void mapFunction(ExecutionEnvironment env) throws Exception {
        DataSource<Integer> data = env.fromCollection(Arrays.asList(1,2,3,4,5,6,7,8,9,10));
        data.map(x -> x + 1).print();
    }
}

運行結果(省略日誌)

(3,小隊長,成都)
(1,PK哥,北京)
(4,天空藍,-)
(5,-,杭州)
(2,J哥,上海)

Scala代碼

import com.guanjian.flink.scala.until.DBUntils
import org.apache.flink.api.common.operators.Order
import org.apache.flink.api.scala.ExecutionEnvironment
import org.apache.flink.api.scala._
import org.apache.flink.util.Collector

import scala.collection.mutable.ListBuffer

object DataSetTransformationApp {
  def main(args: Array[String]): Unit = {
    val env = ExecutionEnvironment.getExecutionEnvironment
//    mapFunction(env)
//    filterFunction(env)
//    mapPartitionFunction(env)
//    firstFunction(env)
//    flatMapFunction(env)
//    distinctFunction(env)
    joinFunction(env)
  }

  def joinFunction(env: ExecutionEnvironment): Unit = {
    val info1 = List((1,"PK哥"),(2,"J哥"),(3,"小隊長"),(4,"天空藍"))
    val info2 = List((1,"北京"),(2,"上海"),(3,"成都"),(5,"杭州"))
    val data1 = env.fromCollection(info1)
    val data2 = env.fromCollection(info2)
    data1.fullOuterJoin(data2).where(0).equalTo(0).apply((first,second) => {
      if (first == null) {
        (second._1,"-",second._2)
      }else if (second == null) {
        (first._1,first._2,"-")
      }else {
        (first._1,first._2,second._2)
      }
    }).print()
  }

  def distinctFunction(env: ExecutionEnvironment): Unit = {
    val info = List("hadoop,spark","hadoop,flink","flink,flink")
    val data = env.fromCollection(info)
    data.flatMap(_.split(",")).distinct().print()
  }

  def flatMapFunction(env: ExecutionEnvironment): Unit = {
    val info = List("hadoop,spark","hadoop,flink","flink,flink")
    val data = env.fromCollection(info)
    data.flatMap(_.split(",")).map((_,1)).groupBy(0)
      .sum(1).print()
  }

  def firstFunction(env: ExecutionEnvironment): Unit = {
    val info = List((1,"Hadoop"),(1,"Spark"),(1,"Flink"),(2,"Java"),
      (2,"Spring boot"),(3,"Linux"),(4,"VUE"))
    val data = env.fromCollection(info)
    data.groupBy(0).sortGroup(1,Order.DESCENDING).first(2).print()
  }

  def mapPartitionFunction(env: ExecutionEnvironment): Unit = {
    val students = new ListBuffer[String]
    for(i <- 1 to 100) {
      students.append("student: " + i)
    }
    val data = env.fromCollection(students).setParallelism(4)
    data.map(student => {
      val connection = DBUntils.getConnection()
      //TODO 數據庫操作
      DBUntils.returnConnection(connection)
      connection
    }).print()
    println("-----------")
    data.mapPartition((student,collector: Collector[Int]) => {
      val connection = DBUntils.getConnection()
      //TODO 數據庫操作
      DBUntils.returnConnection(connection)
      collector.collect(connection)
    }).print()
  }

  def filterFunction(env: ExecutionEnvironment): Unit = {
    val data = env.fromCollection(List(1,2,3,4,5,6,7,8,9,10))
    data.map(_ + 1).filter(_ > 5).print()
  }

  def mapFunction(env: ExecutionEnvironment): Unit = {
    val data = env.fromCollection(List(1,2,3,4,5,6,7,8,9,10))
    data.map(_ + 1).print()
  }
}

運行結果(省略日誌)

(3,小隊長,成都)
(1,PK哥,北京)
(4,天空藍,-)
(5,-,杭州)
(2,J哥,上海)

cross算子

笛卡爾積

public class DataSetTransformationApp {
    public static void main(String[] args) throws Exception {
        ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
//        mapFunction(env);
//        filterFunction(env);
//        mapPartitionFunction(env);
//        firstFunction(env);
//        flatMapFunction(env);
//        distinctFunction(env);
//        joinFunction(env);
//        outJoinFunction(env);
        crossFunction(env);
    }

    public static void crossFunction(ExecutionEnvironment env) throws Exception {
        List<String> info1 = Arrays.asList("曼聯","曼城");
        List<Integer> info2 = Arrays.asList(3,1,0);
        DataSource<String> data1 = env.fromCollection(info1);
        DataSource<Integer> data2 = env.fromCollection(info2);
        data1.cross(data2).print();
    }

    public static void outJoinFunction(ExecutionEnvironment env) throws Exception {
        List<Tuple2<Integer,String>> info1 = Arrays.asList(new Tuple2<>(1,"PK哥"),
                new Tuple2<>(2,"J哥"),
                new Tuple2<>(3,"小隊長"),
                new Tuple2<>(4,"天空藍"));
        List<Tuple2<Integer,String>> info2 = Arrays.asList(new Tuple2<>(1,"北京"),
                new Tuple2<>(2,"上海"),
                new Tuple2<>(3,"成都"),
                new Tuple2<>(5,"杭州"));
        DataSource<Tuple2<Integer, String>> data1 = env.fromCollection(info1);
        DataSource<Tuple2<Integer, String>> data2 = env.fromCollection(info2);
        data1.fullOuterJoin(data2).where(0).equalTo(0)
                .with(new JoinFunction<Tuple2<Integer,String>, Tuple2<Integer,String>, Tuple3<Integer,String,String>>() {
                    @Override
                    public Tuple3<Integer, String, String> join(Tuple2<Integer, String> first, Tuple2<Integer, String> second) throws Exception {
                        if (first == null) {
                            return new Tuple3<>(second.getField(0),"-",second.getField(1));
                        }else if (second == null) {
                            return new Tuple3<>(first.getField(0),first.getField(1),"-");
                        }
                        return new Tuple3<>(first.getField(0),first.getField(1),second.getField(1));
                    }
                }).print();
    }

    public static void joinFunction(ExecutionEnvironment env) throws Exception {
        List<Tuple2<Integer,String>> info1 = Arrays.asList(new Tuple2<>(1,"PK哥"),
                new Tuple2<>(2,"J哥"),
                new Tuple2<>(3,"小隊長"),
                new Tuple2<>(4,"天空藍"));
        List<Tuple2<Integer,String>> info2 = Arrays.asList(new Tuple2<>(1,"北京"),
                new Tuple2<>(2,"上海"),
                new Tuple2<>(3,"成都"),
                new Tuple2<>(4,"杭州"));
        DataSource<Tuple2<Integer, String>> data1 = env.fromCollection(info1);
        DataSource<Tuple2<Integer, String>> data2 = env.fromCollection(info2);
        data1.join(data2).where(0).equalTo(0)
                .with(new JoinFunction<Tuple2<Integer,String>, Tuple2<Integer,String>, Tuple3<Integer,String,String>>() {
                    @Override
                    public Tuple3<Integer, String, String> join(Tuple2<Integer, String> first, Tuple2<Integer, String> second) throws Exception {
                        return new Tuple3<>(first.getField(0),first.getField(1),second.getField(1));
                    }
                }).print();
    }

    public static void distinctFunction(ExecutionEnvironment env) throws Exception {
        List<String> info = Arrays.asList("hadoop,spark", "hadoop,flink", "flink,flink");
        DataSource<String> data = env.fromCollection(info);
        data.flatMap((FlatMapFunction<String, String>) (value, collector) -> {
            String tokens[] = value.split(",");
            Stream.of(tokens).forEach(collector::collect);
        }).returns(String.class)
                .distinct().print();
    }

    public static void flatMapFunction(ExecutionEnvironment env) throws Exception {
        List<String> info = Arrays.asList("hadoop,spark", "hadoop,flink", "flink,flink");
        DataSource<String> data = env.fromCollection(info);
        data.flatMap((FlatMapFunction<String, String>) (value, collector) -> {
            String tokens[] = value.split(",");
            Stream.of(tokens).forEach(collector::collect);
        }).returns(String.class)
                .map(new MapFunction<String, Tuple2<String,Integer>>() {
                    @Override
                    public Tuple2<String, Integer> map(String value) throws Exception {
                        return new Tuple2<>(value,1);
                    }
                })
            .groupBy(0).sum(1).print();
    }

    public static void firstFunction(ExecutionEnvironment env) throws Exception {
        List<Tuple2<Integer,String>> info = Arrays.asList(new Tuple2<>(1,"Hadoop"),
                new Tuple2<>(1,"Spark"),
                new Tuple2<>(1,"Flink"),
                new Tuple2<>(2,"Java"),
                new Tuple2<>(2,"Spring boot"),
                new Tuple2<>(3,"Linux"),
                new Tuple2<>(4,"VUE"));
        DataSource<Tuple2<Integer, String>> data = env.fromCollection(info);
        data.groupBy(0).sortGroup(1, Order.DESCENDING).first(2).print();
    }

    public static void mapPartitionFunction(ExecutionEnvironment env) throws Exception {
        List<String> students = new ArrayList<>();
        for (int i = 0; i < 100; i++) {
            students.add("student: " + i);
        }
        DataSource<String> data = env.fromCollection(students).setParallelism(4);
        //此處會按照students的數量進行轉換
        data.map(student -> {
            int connection = DBUntils.getConnection();
            //TODO 數據庫操作
            DBUntils.returnConnection(connection);
            return connection;
        }).print();
        System.out.println("-----------");
        //此處會按照並行度的數量進行轉換
        data.mapPartition((MapPartitionFunction<String,Integer>)(student, collector) -> {
            int connection = DBUntils.getConnection();
            //TODO 數據庫操作
            DBUntils.returnConnection(connection);
            collector.collect(connection);
        }).returns(Integer.class).print();
    }

    public static void filterFunction(ExecutionEnvironment env) throws Exception {
        DataSource<Integer> data = env.fromCollection(Arrays.asList(1,2,3,4,5,6,7,8,9,10));
        data.map(x -> x + 1).filter(x -> x > 5).print();
    }

    public static void mapFunction(ExecutionEnvironment env) throws Exception {
        DataSource<Integer> data = env.fromCollection(Arrays.asList(1,2,3,4,5,6,7,8,9,10));
        data.map(x -> x + 1).print();
    }
}

運行結果(省略日誌)

(曼聯,3)
(曼聯,1)
(曼聯,0)
(曼城,3)
(曼城,1)
(曼城,0)

Scala代碼

import com.guanjian.flink.scala.until.DBUntils
import org.apache.flink.api.common.operators.Order
import org.apache.flink.api.scala.ExecutionEnvironment
import org.apache.flink.api.scala._
import org.apache.flink.util.Collector

import scala.collection.mutable.ListBuffer

object DataSetTransformationApp {
  def main(args: Array[String]): Unit = {
    val env = ExecutionEnvironment.getExecutionEnvironment
//    mapFunction(env)
//    filterFunction(env)
//    mapPartitionFunction(env)
//    firstFunction(env)
//    flatMapFunction(env)
//    distinctFunction(env)
//    joinFunction(env)
    crossFunction(env)
  }

  def crossFunction(env: ExecutionEnvironment): Unit = {
    val info1 = List("曼聯","曼城")
    val info2 = List(3,1,0)
    val data1 = env.fromCollection(info1)
    val data2 = env.fromCollection(info2)
    data1.cross(data2).print()
  }

  def joinFunction(env: ExecutionEnvironment): Unit = {
    val info1 = List((1,"PK哥"),(2,"J哥"),(3,"小隊長"),(4,"天空藍"))
    val info2 = List((1,"北京"),(2,"上海"),(3,"成都"),(5,"杭州"))
    val data1 = env.fromCollection(info1)
    val data2 = env.fromCollection(info2)
    data1.fullOuterJoin(data2).where(0).equalTo(0).apply((first,second) => {
      if (first == null) {
        (second._1,"-",second._2)
      }else if (second == null) {
        (first._1,first._2,"-")
      }else {
        (first._1,first._2,second._2)
      }
    }).print()
  }

  def distinctFunction(env: ExecutionEnvironment): Unit = {
    val info = List("hadoop,spark","hadoop,flink","flink,flink")
    val data = env.fromCollection(info)
    data.flatMap(_.split(",")).distinct().print()
  }

  def flatMapFunction(env: ExecutionEnvironment): Unit = {
    val info = List("hadoop,spark","hadoop,flink","flink,flink")
    val data = env.fromCollection(info)
    data.flatMap(_.split(",")).map((_,1)).groupBy(0)
      .sum(1).print()
  }

  def firstFunction(env: ExecutionEnvironment): Unit = {
    val info = List((1,"Hadoop"),(1,"Spark"),(1,"Flink"),(2,"Java"),
      (2,"Spring boot"),(3,"Linux"),(4,"VUE"))
    val data = env.fromCollection(info)
    data.groupBy(0).sortGroup(1,Order.DESCENDING).first(2).print()
  }

  def mapPartitionFunction(env: ExecutionEnvironment): Unit = {
    val students = new ListBuffer[String]
    for(i <- 1 to 100) {
      students.append("student: " + i)
    }
    val data = env.fromCollection(students).setParallelism(4)
    data.map(student => {
      val connection = DBUntils.getConnection()
      //TODO 數據庫操作
      DBUntils.returnConnection(connection)
      connection
    }).print()
    println("-----------")
    data.mapPartition((student,collector: Collector[Int]) => {
      val connection = DBUntils.getConnection()
      //TODO 數據庫操作
      DBUntils.returnConnection(connection)
      collector.collect(connection)
    }).print()
  }

  def filterFunction(env: ExecutionEnvironment): Unit = {
    val data = env.fromCollection(List(1,2,3,4,5,6,7,8,9,10))
    data.map(_ + 1).filter(_ > 5).print()
  }

  def mapFunction(env: ExecutionEnvironment): Unit = {
    val data = env.fromCollection(List(1,2,3,4,5,6,7,8,9,10))
    data.map(_ + 1).print()
  }
}

運行結果(省略日誌)

(曼聯,3)
(曼聯,1)
(曼聯,0)
(曼城,3)
(曼城,1)
(曼城,0)

Sink(輸出)

我們在flink文件夾下面新增一個sink-out的文件夾,此時文件夾爲空

輸出成文本文件

public class DataSetSinkApp {
    public static void main(String[] args) throws Exception {
        ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
        List<Integer> data = new ArrayList<>();
        for (int i = 1; i <= 10; i++) {
            data.add(i);
        }
        DataSource<Integer> text = env.fromCollection(data);
        String filePath = "/Users/admin/Downloads/flink/sink-out/sinkjava/";
        text.writeAsText(filePath);
        env.execute("DataSetSinkApp");
    }
}

運行結果

進入sink-out文件夾

Scala代碼

import org.apache.flink.api.scala.ExecutionEnvironment
import org.apache.flink.api.scala._

object DataSetSinkApp {
  def main(args: Array[String]): Unit = {
    val env = ExecutionEnvironment.getExecutionEnvironment
    val data = 1 to 10
    val text = env.fromCollection(data)
    val filePath = "/Users/admin/Downloads/flink/sink-out/sinkscala/"
    text.writeAsText(filePath)
    env.execute("DataSetSinkApp")
  }
}

運行結果

如果此時我們再次運行代碼就會報錯,因爲輸出文件已經存在,如果要覆蓋該文件,則需要調整代碼

public class DataSetSinkApp {
    public static void main(String[] args) throws Exception {
        ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
        List<Integer> data = new ArrayList<>();
        for (int i = 1; i <= 10; i++) {
            data.add(i);
        }
        DataSource<Integer> text = env.fromCollection(data);
        String filePath = "/Users/admin/Downloads/flink/sink-out/sinkjava/";
        text.writeAsText(filePath, FileSystem.WriteMode.OVERWRITE);
        env.execute("DataSetSinkApp");
    }
}

Scala代碼

import org.apache.flink.api.scala.ExecutionEnvironment
import org.apache.flink.api.scala._
import org.apache.flink.core.fs.FileSystem.WriteMode

object DataSetSinkApp {
  def main(args: Array[String]): Unit = {
    val env = ExecutionEnvironment.getExecutionEnvironment
    val data = 1 to 10
    val text = env.fromCollection(data)
    val filePath = "/Users/admin/Downloads/flink/sink-out/sinkscala/"
    text.writeAsText(filePath,WriteMode.OVERWRITE)
    env.execute("DataSetSinkApp")
  }
}

增加並行度,輸出多個文件

public class DataSetSinkApp {
    public static void main(String[] args) throws Exception {
        ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
        List<Integer> data = new ArrayList<>();
        for (int i = 1; i <= 10; i++) {
            data.add(i);
        }
        DataSource<Integer> text = env.fromCollection(data);
        String filePath = "/Users/admin/Downloads/flink/sink-out/sinkjava/";
        text.writeAsText(filePath, FileSystem.WriteMode.OVERWRITE).setParallelism(4);
        env.execute("DataSetSinkApp");
    }
}

運行結果

此時我們可以看到sinkjava變成了一個文件夾,而該文件夾下面有4個文件

Scala代碼

import org.apache.flink.api.scala.ExecutionEnvironment
import org.apache.flink.api.scala._
import org.apache.flink.core.fs.FileSystem.WriteMode

object DataSetSinkApp {
  def main(args: Array[String]): Unit = {
    val env = ExecutionEnvironment.getExecutionEnvironment
    val data = 1 to 10
    val text = env.fromCollection(data)
    val filePath = "/Users/admin/Downloads/flink/sink-out/sinkscala/"
    text.writeAsText(filePath,WriteMode.OVERWRITE).setParallelism(4)
    env.execute("DataSetSinkApp")
  }
}

運行結果

計數器

public class CounterApp {
    public static void main(String[] args) throws Exception {
        ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
        DataSource<String> data = env.fromElements("hadoop", "spark", "flink", "pyspark", "storm");
        String filePath = "/Users/admin/Downloads/flink/sink-out/sink-java-counter/";
        data.map(new RichMapFunction<String, String>() {
            LongCounter counter = new LongCounter();

            @Override
            public void open(Configuration parameters) throws Exception {
                super.open(parameters);
                getRuntimeContext().addAccumulator("ele-counts-java", counter);
            }

            @Override
            public String map(String value) throws Exception {
                counter.add(1);
                return value;
            }
        }).writeAsText(filePath, FileSystem.WriteMode.OVERWRITE).setParallelism(4);
        JobExecutionResult jobResult = env.execute("CounterApp");
        Long num = jobResult.getAccumulatorResult("ele-counts-java");
        System.out.println("num: " + num);
    }
}

運行結果(省略日誌)

num: 5

Scala代碼

import org.apache.flink.api.common.accumulators.LongCounter
import org.apache.flink.api.common.functions.RichMapFunction
import org.apache.flink.api.scala.ExecutionEnvironment
import org.apache.flink.api.scala._
import org.apache.flink.configuration.Configuration
import org.apache.flink.core.fs.FileSystem.WriteMode

object CounterApp {
  def main(args: Array[String]): Unit = {
    val env = ExecutionEnvironment.getExecutionEnvironment
    val data = env.fromElements("hadoop", "spark", "flink", "pyspark", "storm")
    val filePath = "/Users/admin/Downloads/flink/sink-out/sink-scala-counter/"
    data.map(new RichMapFunction[String,String]() {
      val counter = new LongCounter

      override def open(parameters: Configuration): Unit = {
        getRuntimeContext.addAccumulator("ele-counts-scala", counter)
      }

      override def map(value: String) = {
        counter.add(1)
        value
      }
    }).writeAsText(filePath,WriteMode.OVERWRITE).setParallelism(4)
    val jobResult = env.execute("CounterApp")
    val num = jobResult.getAccumulatorResult[Long]("ele-counts-scala")
    println("num: " + num)
  }
}

運行結果(省略日誌)

num: 5

分佈式緩存

public class DistriutedCacheApp {
    public static void main(String[] args) throws Exception {
        ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
        String filePath = "/Users/admin/Downloads/flink/data/hello.txt";
        env.registerCachedFile(filePath,"pk-java-dc");
        DataSource<String> data = env.fromElements("hadoop", "spark", "flink", "pyspark", "storm");
        data.map(new RichMapFunction<String, String>() {
            @Override
            public void open(Configuration parameters) throws Exception {
                super.open(parameters);
                File dcFile = getRuntimeContext().getDistributedCache().getFile("pk-java-dc");
                List<String> lines = FileUtils.readLines(dcFile);
                lines.forEach(System.out::println);
            }

            @Override
            public String map(String value) throws Exception {
                return value;
            }
        }).print();
    }
}

運行結果(省略日誌)

hello world welcome
hello welcome
hadoop
spark
flink
pyspark
storm

Scala代碼

import org.apache.commons.io.FileUtils
import org.apache.flink.api.common.functions.RichMapFunction
import org.apache.flink.api.scala.ExecutionEnvironment
import org.apache.flink.api.scala._
import org.apache.flink.configuration.Configuration
import scala.collection.JavaConverters._

object DistriutedCacheApp {
  def main(args: Array[String]): Unit = {
    val env = ExecutionEnvironment.getExecutionEnvironment
    val filePath = "/Users/admin/Downloads/flink/data/hello.txt"
    env.registerCachedFile(filePath,"pk-scala-dc")
    val data = env.fromElements("hadoop", "spark", "flink", "pyspark", "storm")
    data.map(new RichMapFunction[String,String] {
      override def open(parameters: Configuration): Unit = {
        val dcFile = getRuntimeContext.getDistributedCache.getFile("pk-scala-dc")
        val lines = FileUtils.readLines(dcFile)
        lines.asScala.foreach(println(_))
      }

      override def map(value: String) = {
        value
      }
    })
  }.print()
}

運行結果(省略日誌)

hello world welcome
hello welcome
hadoop
spark
flink
pyspark
storm

流處理

socket

public class DataStreamSourceApp {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        socketFunction(env);
        env.execute("DataStreamSourceApp");
    }

    public static void socketFunction(StreamExecutionEnvironment env) {
        DataStreamSource<String> data = env.socketTextStream("127.0.0.1", 9999);
        data.print().setParallelism(1);
    }
}

運行前執行控制檯

nc -lk 9999

執行後,在控制檯輸入

運行結果(省略日誌)

hello world welcome
hello welcome
hadoop
spark
flink
pyspark
storm

Scala代碼

import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment

object DataStreamSourceApp {
  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    socketFunction(env)
    env.execute("DataStreamSourceApp")
  }

  def socketFunction(env: StreamExecutionEnvironment): Unit = {
    val data = env.socketTextStream("127.0.0.1",9999)
    data.print().setParallelism(1)
  }
}

運行結果(省略日誌)

hello world welcome
hello welcome
hadoop
spark
flink
pyspark
storm

 自定義數據源

不可並行數據源

public class CustomNonParallelSourceFunction implements SourceFunction<Long> {
    private boolean isRunning = true;
    private long count = 1;

    @Override
    public void run(SourceContext<Long> ctx) throws Exception {
        while (isRunning) {
            ctx.collect(count);
            count++;
            Thread.sleep(1000);
        }
    }

    @Override
    public void cancel() {
        isRunning = false;
    }
}
public class DataStreamSourceApp {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
//        socketFunction(env);
        nonParallelSourceFunction(env);
        env.execute("DataStreamSourceApp");
    }

    public static void nonParallelSourceFunction(StreamExecutionEnvironment env) {
        DataStreamSource<Long> data = env.addSource(new CustomNonParallelSourceFunction());
        data.print().setParallelism(1);
    }

    public static void socketFunction(StreamExecutionEnvironment env) {
        DataStreamSource<String> data = env.socketTextStream("127.0.0.1", 9999);
        data.print().setParallelism(1);
    }
}

運行結果(省略日誌,每隔1秒打印一次)

1
2
3
4
5
6
.
.
.

因爲是不可並行,如果我們調大並行度則會報錯,如

public class DataStreamSourceApp {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
//        socketFunction(env);
        nonParallelSourceFunction(env);
        env.execute("DataStreamSourceApp");
    }

    public static void nonParallelSourceFunction(StreamExecutionEnvironment env) {
        DataStreamSource<Long> data = env.addSource(new CustomNonParallelSourceFunction())
                .setParallelism(2);
        data.print().setParallelism(1);
    }

    public static void socketFunction(StreamExecutionEnvironment env) {
        DataStreamSource<String> data = env.socketTextStream("127.0.0.1", 9999);
        data.print().setParallelism(1);
    }
}

結果報錯

Exception in thread "main" java.lang.IllegalArgumentException: Source: 1 is not a parallel source
	at org.apache.flink.streaming.api.datastream.DataStreamSource.setParallelism(DataStreamSource.java:55)
	at com.guanjian.flink.java.test.DataStreamSourceApp.nonParallelSourceFunction(DataStreamSourceApp.java:17)
	at com.guanjian.flink.java.test.DataStreamSourceApp.main(DataStreamSourceApp.java:11)

Scala代碼

import org.apache.flink.streaming.api.functions.source.SourceFunction

class CustomNonParallelSourceFunction extends SourceFunction[Long] {
  private var isRunning = true
  private var count = 1L

  override def cancel(): Unit = {
    isRunning = false
  }

  override def run(ctx: SourceFunction.SourceContext[Long]): Unit = {
    while (isRunning) {
      ctx.collect(count)
      count += 1
      Thread.sleep(1000)
    }
  }
}
import com.guanjian.flink.scala.until.CustomNonParallelSourceFunction
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.api.scala._

object DataStreamSourceApp {
  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
//    socketFunction(env)
    nonParallelSourceFunction(env)
    env.execute("DataStreamSourceApp")
  }

  def nonParallelSourceFunction(env: StreamExecutionEnvironment): Unit = {
    val data = env.addSource(new CustomNonParallelSourceFunction)
    data.print().setParallelism(1)
  }

  def socketFunction(env: StreamExecutionEnvironment): Unit = {
    val data = env.socketTextStream("127.0.0.1",9999)
    data.print().setParallelism(1)
  }
}

運行結果(省略日誌,每隔1秒打印一次)

1
2
3
4
5
6
.
.
.

可並行數據源

public class CustomParallelSourceFunction implements ParallelSourceFunction<Long> {
    private boolean isRunning = true;
    private long count = 1;

    @Override
    public void run(SourceContext<Long> ctx) throws Exception {
        while (isRunning) {
            ctx.collect(count);
            count++;
            Thread.sleep(1000);
        }
    }

    @Override
    public void cancel() {
        isRunning = false;
    }
}
public class DataStreamSourceApp {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
//        socketFunction(env);
//        nonParallelSourceFunction(env);
        parallelSourceFunction(env);
        env.execute("DataStreamSourceApp");
    }

    public static void parallelSourceFunction(StreamExecutionEnvironment env) {
        DataStreamSource<Long> data = env.addSource(new CustomParallelSourceFunction())
                .setParallelism(2);
        data.print().setParallelism(1);
    }

    public static void nonParallelSourceFunction(StreamExecutionEnvironment env) {
        DataStreamSource<Long> data = env.addSource(new CustomNonParallelSourceFunction());
        data.print().setParallelism(1);
    }

    public static void socketFunction(StreamExecutionEnvironment env) {
        DataStreamSource<String> data = env.socketTextStream("127.0.0.1", 9999);
        data.print().setParallelism(1);
    }
}

運行結果(省略日誌,每隔1秒打印一次,每次打印兩條)

1
1
2
2
3
3
4
4
5
5
.
.
.
.

Scala代碼

import org.apache.flink.streaming.api.functions.source.{ParallelSourceFunction, SourceFunction}

class CustomParallelSourceFunction extends ParallelSourceFunction[Long] {
  private var isRunning = true
  private var count = 1L

  override def cancel(): Unit = {
    isRunning = false
  }

  override def run(ctx: SourceFunction.SourceContext[Long]): Unit = {
    while (isRunning) {
      ctx.collect(count)
      count += 1
      Thread.sleep(1000)
    }
  }
}
import com.guanjian.flink.scala.until.{CustomNonParallelSourceFunction, CustomParallelSourceFunction}
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.api.scala._

object DataStreamSourceApp {
  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
//    socketFunction(env)
//    nonParallelSourceFunction(env)
    parallelSourceFunction(env)
    env.execute("DataStreamSourceApp")
  }

  def parallelSourceFunction(env: StreamExecutionEnvironment): Unit = {
    val data = env.addSource(new CustomParallelSourceFunction)
      .setParallelism(2)
    data.print().setParallelism(1)
  }

  def nonParallelSourceFunction(env: StreamExecutionEnvironment): Unit = {
    val data = env.addSource(new CustomNonParallelSourceFunction)
    data.print().setParallelism(1)
  }

  def socketFunction(env: StreamExecutionEnvironment): Unit = {
    val data = env.socketTextStream("127.0.0.1",9999)
    data.print().setParallelism(1)
  }
}

運行結果(省略日誌,每隔1秒打印一次,每次打印兩條)

1
1
2
2
3
3
4
4
5
5
.
.
.
.

增強數據源

public class CustomRichParallelSourceFunction extends RichParallelSourceFunction<Long> {
    private boolean isRunning = true;
    private long count = 1;

    /**
     * 可以在這裏面實現一些其他需求的代碼
     * @param parameters
     * @throws Exception
     */
    @Override
    public void open(Configuration parameters) throws Exception {
        super.open(parameters);
    }

    @Override
    public void close() throws Exception {
        super.close();
    }

    @Override
    public void run(SourceContext<Long> ctx) throws Exception {
        while (isRunning) {
            ctx.collect(count);
            count++;
            Thread.sleep(1000);
        }
    }

    @Override
    public void cancel() {
        isRunning = false;
    }
}
public class DataStreamSourceApp {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
//        socketFunction(env);
//        nonParallelSourceFunction(env);
//        parallelSourceFunction(env);
        richParallelSourceFunction(env);
        env.execute("DataStreamSourceApp");
    }

    public static void richParallelSourceFunction(StreamExecutionEnvironment env) {
        DataStreamSource<Long> data = env.addSource(new CustomRichParallelSourceFunction())
                .setParallelism(2);
        data.print().setParallelism(1);
    }

    public static void parallelSourceFunction(StreamExecutionEnvironment env) {
        DataStreamSource<Long> data = env.addSource(new CustomParallelSourceFunction())
                .setParallelism(2);
        data.print().setParallelism(1);
    }

    public static void nonParallelSourceFunction(StreamExecutionEnvironment env) {
        DataStreamSource<Long> data = env.addSource(new CustomNonParallelSourceFunction());
        data.print().setParallelism(1);
    }

    public static void socketFunction(StreamExecutionEnvironment env) {
        DataStreamSource<String> data = env.socketTextStream("127.0.0.1", 9999);
        data.print().setParallelism(1);
    }
}

運行結果與可並行數據源相同

Scala代碼

import org.apache.flink.configuration.Configuration
import org.apache.flink.streaming.api.functions.source.{RichParallelSourceFunction, SourceFunction}

class CustomRichParallelSourceFunction extends RichParallelSourceFunction[Long] {
  private var isRunning = true
  private var count = 1L

  override def open(parameters: Configuration): Unit = {

  }

  override def close(): Unit = {

  }

  override def cancel(): Unit = {
    isRunning = false
  }

  override def run(ctx: SourceFunction.SourceContext[Long]): Unit = {
    while (isRunning) {
      ctx.collect(count)
      count += 1
      Thread.sleep(1000)
    }
  }
}
import com.guanjian.flink.scala.until.{CustomNonParallelSourceFunction, CustomParallelSourceFunction, CustomRichParallelSourceFunction}
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.api.scala._

object DataStreamSourceApp {
  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
//    socketFunction(env)
//    nonParallelSourceFunction(env)
//    parallelSourceFunction(env)
    richParallelSourceFunction(env)
    env.execute("DataStreamSourceApp")
  }

  def richParallelSourceFunction(env: StreamExecutionEnvironment): Unit = {
    val data = env.addSource(new CustomRichParallelSourceFunction)
      .setParallelism(2)
    data.print().setParallelism(1)
  }

  def parallelSourceFunction(env: StreamExecutionEnvironment): Unit = {
    val data = env.addSource(new CustomParallelSourceFunction)
      .setParallelism(2)
    data.print().setParallelism(1)
  }

  def nonParallelSourceFunction(env: StreamExecutionEnvironment): Unit = {
    val data = env.addSource(new CustomNonParallelSourceFunction)
    data.print().setParallelism(1)
  }

  def socketFunction(env: StreamExecutionEnvironment): Unit = {
    val data = env.socketTextStream("127.0.0.1",9999)
    data.print().setParallelism(1)
  }
}

運行結果與可並行數據源相同

流算子

map和filter

public class DataStreamTransformationApp {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        filterFunction(env);
        env.execute("DataStreamTransformationApp");
    }

    public static void filterFunction(StreamExecutionEnvironment env) {
        DataStreamSource<Long> data = env.addSource(new CustomNonParallelSourceFunction());
        data.map(x -> {
            System.out.println("接收到: " + x);
            return x;
        }).filter(x -> x % 2 == 0).print().setParallelism(1);
    }
}

 運行結果(省略日誌)

接收到: 1
接收到: 2
2
接收到: 3
接收到: 4
4
接收到: 5
接收到: 6
6
.
.

Scala代碼

import com.guanjian.flink.scala.until.CustomNonParallelSourceFunction
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.api.scala._

object DataStreamTransformationApp {
  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    filterFunction(env)
    env.execute("DataStreamTransformationApp")
  }

  def filterFunction(env: StreamExecutionEnvironment): Unit = {
    val data = env.addSource(new CustomNonParallelSourceFunction)
    data.map(x => {
      println("接收到: " + x)
      x
    }).filter(_ % 2 == 0).print().setParallelism(1)
  }
}

 運行結果(省略日誌)

接收到: 1
接收到: 2
2
接收到: 3
接收到: 4
4
接收到: 5
接收到: 6
6
.
.

union

public class DataStreamTransformationApp {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
//        filterFunction(env);
        unionFunction(env);
        env.execute("DataStreamTransformationApp");
    }

    public static void unionFunction(StreamExecutionEnvironment env) {
        DataStreamSource<Long> data1 = env.addSource(new CustomNonParallelSourceFunction());
        DataStreamSource<Long> data2 = env.addSource(new CustomNonParallelSourceFunction());
        data1.union(data2).print().setParallelism(1);
    }

    public static void filterFunction(StreamExecutionEnvironment env) {
        DataStreamSource<Long> data = env.addSource(new CustomNonParallelSourceFunction());
        data.map(x -> {
            System.out.println("接收到: " + x);
            return x;
        }).filter(x -> x % 2 == 0).print().setParallelism(1);
    }
}

運行結果(省略日誌)

1
1
2
2
3
3
4
4
5
5
.
.

Scala代碼

import com.guanjian.flink.scala.until.CustomNonParallelSourceFunction
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.api.scala._

object DataStreamTransformationApp {
  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
//    filterFunction(env)
    unionFunction(env)
    env.execute("DataStreamTransformationApp")
  }

  def unionFunction(env: StreamExecutionEnvironment): Unit = {
    val data1 = env.addSource(new CustomNonParallelSourceFunction)
    val data2 = env.addSource(new CustomNonParallelSourceFunction)
    data1.union(data2).print().setParallelism(1)
  }

  def filterFunction(env: StreamExecutionEnvironment): Unit = {
    val data = env.addSource(new CustomNonParallelSourceFunction)
    data.map(x => {
      println("接收到: " + x)
      x
    }).filter(_ % 2 == 0).print().setParallelism(1)
  }
}

運行結果(省略日誌)

1
1
2
2
3
3
4
4
5
5
.
.

split和select

將一個流拆成多個流以及挑選其中一個流

public class DataStreamTransformationApp {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
//        filterFunction(env);
//        unionFunction(env);
        splitSelectFunction(env);
        env.execute("DataStreamTransformationApp");
    }

    public static void splitSelectFunction(StreamExecutionEnvironment env) {
        DataStreamSource<Long> data = env.addSource(new CustomNonParallelSourceFunction());
        SplitStream<Long> splits = data.split(value -> {
            List<String> list = new ArrayList<>();
            if (value % 2 == 0) {
                list.add("偶數");
            } else {
                list.add("奇數");
            }
            return list;
        });
        splits.select("奇數").print().setParallelism(1);
    }

    public static void unionFunction(StreamExecutionEnvironment env) {
        DataStreamSource<Long> data1 = env.addSource(new CustomNonParallelSourceFunction());
        DataStreamSource<Long> data2 = env.addSource(new CustomNonParallelSourceFunction());
        data1.union(data2).print().setParallelism(1);
    }

    public static void filterFunction(StreamExecutionEnvironment env) {
        DataStreamSource<Long> data = env.addSource(new CustomNonParallelSourceFunction());
        data.map(x -> {
            System.out.println("接收到: " + x);
            return x;
        }).filter(x -> x % 2 == 0).print().setParallelism(1);
    }
}

運行結果(省略日誌)

1
3
5
7
9
11
.
.

Scala代碼

import com.guanjian.flink.scala.until.CustomNonParallelSourceFunction
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.api.scala._
import org.apache.flink.streaming.api.collector.selector.OutputSelector
import java.util.ArrayList
import java.lang.Iterable

object DataStreamTransformationApp {
  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
//    filterFunction(env)
//    unionFunction(env)
    splitSelectFunction(env)
    env.execute("DataStreamTransformationApp")
  }

  def splitSelectFunction(env: StreamExecutionEnvironment): Unit = {
    val data = env.addSource(new CustomNonParallelSourceFunction)
    val splits = data.split(new OutputSelector[Long] {
      override def select(value: Long): Iterable[String] = {
        val list = new ArrayList[String]
        if (value % 2 == 0) {
          list.add("偶數")
        } else {
          list.add("奇數")
        }
        list
      }
    })
    splits.select("奇數").print().setParallelism(1)
  }

  def unionFunction(env: StreamExecutionEnvironment): Unit = {
    val data1 = env.addSource(new CustomNonParallelSourceFunction)
    val data2 = env.addSource(new CustomNonParallelSourceFunction)
    data1.union(data2).print().setParallelism(1)
  }

  def filterFunction(env: StreamExecutionEnvironment): Unit = {
    val data = env.addSource(new CustomNonParallelSourceFunction)
    data.map(x => {
      println("接收到: " + x)
      x
    }).filter(_ % 2 == 0).print().setParallelism(1)
  }
}

運行結果(省略日誌)

1
3
5
7
9
11
.
.

這裏需要說明的是split已經被設置爲不推薦使用的方法

@deprecated
def split(selector: OutputSelector[T]): SplitStream[T] = asScalaStream(stream.split(selector))

因爲OutputSelector函數式接口的返回類型爲一個Java專屬類型,對於Scala是不友好的,所以Scala這裏也無法使用lambda表達式

當然select也可以選取多個流

public class DataStreamTransformationApp {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
//        filterFunction(env);
//        unionFunction(env);
        splitSelectFunction(env);
        env.execute("DataStreamTransformationApp");
    }

    public static void splitSelectFunction(StreamExecutionEnvironment env) {
        DataStreamSource<Long> data = env.addSource(new CustomNonParallelSourceFunction());
        SplitStream<Long> splits = data.split(value -> {
            List<String> list = new ArrayList<>();
            if (value % 2 == 0) {
                list.add("偶數");
            } else {
                list.add("奇數");
            }
            return list;
        });
        splits.select("奇數","偶數").print().setParallelism(1);
    }

    public static void unionFunction(StreamExecutionEnvironment env) {
        DataStreamSource<Long> data1 = env.addSource(new CustomNonParallelSourceFunction());
        DataStreamSource<Long> data2 = env.addSource(new CustomNonParallelSourceFunction());
        data1.union(data2).print().setParallelism(1);
    }

    public static void filterFunction(StreamExecutionEnvironment env) {
        DataStreamSource<Long> data = env.addSource(new CustomNonParallelSourceFunction());
        data.map(x -> {
            System.out.println("接收到: " + x);
            return x;
        }).filter(x -> x % 2 == 0).print().setParallelism(1);
    }
}

運行結果(省略日誌)

1
2
3
4
5
6
.
.

Scala代碼修改是一樣的,這裏就不重複了

流Sink

自定義Sink

將socket中的數據傳入mysql中

@Data
@ToString
@AllArgsConstructor
@NoArgsConstructor
public class Student {
    private int id;
    private String name;
    private int age;
}
public class SinkToMySQL extends RichSinkFunction<Student> {
    private Connection connection;
    private PreparedStatement pstmt;

    private Connection getConnection() {
        Connection conn = null;
        try {
            Class.forName("com.mysql.cj.jdbc.Driver");
            String url = "jdbc:mysql://127.0.0.1:3306/flink";
            conn = DriverManager.getConnection(url,"root","abcd123");
        }catch (Exception e) {
            e.printStackTrace();
        }
        return conn;
    }

    @Override
    public void open(Configuration parameters) throws Exception {
        super.open(parameters);
        connection = getConnection();
        String sql = "insert into student(id,name,age) values (?,?,?)";
        pstmt = connection.prepareStatement(sql);
    }

    @Override
    public void invoke(Student value) throws Exception {
        System.out.println("invoke--------");
        pstmt.setInt(1,value.getId());
        pstmt.setString(2,value.getName());
        pstmt.setInt(3,value.getAge());
        pstmt.executeUpdate();
    }

    @Override
    public void close() throws Exception {
        super.close();
        if (pstmt != null) {
            pstmt.close();
        }
        if (connection != null) {
            connection.close();
        }
    }
}
public class CustomSinkToMySQL {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        DataStreamSource<String> source = env.socketTextStream("127.0.0.1", 9999);
        SingleOutputStreamOperator<Student> studentStream = source.map(value -> {
            String[] splits = value.split(",");
            Student stu = new Student(Integer.parseInt(splits[0]), splits[1], Integer.parseInt(splits[2]));
            return stu;
        }).returns(Student.class);
        studentStream.addSink(new SinkToMySQL());
        env.execute("CustomSinkToMySQL");
    }
}

代碼執行前執行

nc -lk 9999

執行代碼後輸入

執行結果

Scala代碼

class Student(var id: Int,var name: String,var age: Int) {
}
import java.sql.{Connection, DriverManager, PreparedStatement}

import com.guanjian.flink.scala.test.Student
import org.apache.flink.configuration.Configuration
import org.apache.flink.streaming.api.functions.sink.RichSinkFunction

class SinkToMySQL extends RichSinkFunction[Student] {
  var connection: Connection = null
  var pstmt: PreparedStatement = null

  def getConnection:Connection = {
    var conn: Connection = null
    Class.forName("com.mysql.cj.jdbc.Driver")
    val url = "jdbc:mysql://127.0.0.1:3306/flink"
    conn = DriverManager.getConnection(url, "root", "abcd123")
    conn
  }

  override def open(parameters: Configuration): Unit = {
    connection = getConnection
    val sql = "insert into student(id,name,age) values (?,?,?)"
    pstmt = connection.prepareStatement(sql)
  }

  override def invoke(value: Student): Unit = {
    println("invoke--------")
    pstmt.setInt(1,value.id)
    pstmt.setString(2,value.name)
    pstmt.setInt(3,value.age)
    pstmt.executeUpdate()
  }

  override def close(): Unit = {
    if (pstmt != null) {
      pstmt.close()
    }
    if (connection != null) {
      connection.close()
    }
  }
}
import com.guanjian.flink.scala.until.SinkToMySQL
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.api.scala._

object CustomSinkToMySQL {
  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    val source = env.socketTextStream("127.0.0.1",9999)
    val studentStream = source.map(value => {
      val splits = value.split(",")
      val stu = new Student(splits(0).toInt, splits(1), splits(2).toInt)
      stu
    })
    studentStream.addSink(new SinkToMySQL)
    env.execute("CustomSinkToMySQL")
  }
}

控制檯輸入

運行結果

Table API以及SQL

要使用flink的Table API,Java工程需要添加Scala依賴庫

<dependency>
   <groupId>org.apache.flink</groupId>
   <artifactId>flink-scala_${scala.binary.version}</artifactId>
   <version>${flink.version}</version>
</dependency>
<dependency>
   <groupId>org.apache.flink</groupId>
   <artifactId>flink-streaming-scala_${scala.binary.version}</artifactId>
   <version>${flink.version}</version>
</dependency>
<dependency>
   <groupId>org.scala-lang</groupId>
   <artifactId>scala-library</artifactId>
   <version>${scala.version}</version>
</dependency>
<dependency>
   <groupId>org.apache.flink</groupId>
   <artifactId>flink-table_2.11</artifactId>
   <version>${flink.version}</version>
</dependency>

在data目錄下添加一個sales.csv文件,文件內容如下

transactionId,customerId,itemId,amountPaid
111,1,1,100.0
112,2,3,505.0
113,3,3,510.0
114,4,4,600.0
115,1,2,500.0
116,1,2,500.0
117,1,2,500.0
118,1,2,600.0
119,2,3,400.0
120,1,2,500.0
121,1,4,500.0
122,1,2,500.0
123,1,4,500.0
124,1,2,600.0
import lombok.AllArgsConstructor;
import lombok.Data;
import lombok.NoArgsConstructor;
import lombok.ToString;
import org.apache.flink.api.java.DataSet;
import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.api.java.operators.DataSource;
import org.apache.flink.table.api.Table;
import org.apache.flink.table.api.java.BatchTableEnvironment;
import org.apache.flink.types.Row;

public class TableSQLAPI {
    @Data
    @ToString
    @AllArgsConstructor
    @NoArgsConstructor
    public static class SalesLog {
        private String transactionId;
        private String customerId;
        private String itemId;
        private Double amountPaid;
    }

    public static void main(String[] args) throws Exception {
        ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
        BatchTableEnvironment tableEnv = BatchTableEnvironment.getTableEnvironment(env);
        String filePath = "/Users/admin/Downloads/flink/data/sales.csv";
        DataSource<SalesLog> csv = env.readCsvFile(filePath).ignoreFirstLine()
                .pojoType(SalesLog.class, "transactionId", "customerId",
                        "itemId", "amountPaid");
        Table sales = tableEnv.fromDataSet(csv);
        tableEnv.registerTable("sales",sales);
        Table resultTable = tableEnv.sqlQuery("select customerId,sum(amountPaid) money from sales " +
                "group by customerId");
        DataSet<Row> result = tableEnv.toDataSet(resultTable, Row.class);
        result.print();
    }
}

運行結果(省略日誌)

3,510.0
4,600.0
1,4800.0
2,905.0

Scala代碼

Scala項目同樣要放入依賴

<dependency>
   <groupId>org.apache.flink</groupId>
   <artifactId>flink-table_2.11</artifactId>
   <version>${flink.version}</version>
</dependency>
import org.apache.flink.api.scala.ExecutionEnvironment
import org.apache.flink.table.api.TableEnvironment
import org.apache.flink.types.Row
import org.apache.flink.api.scala._

object TableSQLAPI {
  case class SalesLog(transactionId: String,customerId: String,itemId: String,amountPaid: Double)

  def main(args: Array[String]): Unit = {
    val env = ExecutionEnvironment.getExecutionEnvironment
    val tableEnv = TableEnvironment.getTableEnvironment(env)
    val filePath = "/Users/admin/Downloads/flink/data/sales.csv"
    val csv = env.readCsvFile[SalesLog](filePath,ignoreFirstLine = true,includedFields = Array(0,1,2,3))
    val sales = tableEnv.fromDataSet(csv)
    tableEnv.registerTable("sales",sales)
    val resultTable = tableEnv.sqlQuery("select customerId,sum(amountPaid) money from sales " +
      "group by customerId")
    tableEnv.toDataSet[Row](resultTable).print()
  }
}

運行結果(省略日誌)

3,510.0
4,600.0
1,4800.0
2,905.0

時間和窗口

Flink中有三個時間是比較重要的,包括事件時間(Event Time),處理時間(Processing Time),進入Flink系統的時間(Ingestion Time)

通常我們都是使用事件時間來作爲基準。

設置時間的代碼

env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);

事件時間通常以時間戳的形式會包含在傳入的數據中的一個字段,通過提取,來決定窗口什麼時候來執行。

窗口(Windows)是主要進行流處理(無限流)中,將流數據拆成按照時間段或者大小的一個個的數據桶,窗口分爲兩種,一種是根據key來統計,一種是全部的。它的處理過程如下

Keyed Windows

stream
       .keyBy(...)               <-  keyed versus non-keyed windows
       .window(...)              <-  required: "assigner"
      [.trigger(...)]            <-  optional: "trigger" (else default trigger)
      [.evictor(...)]            <-  optional: "evictor" (else no evictor)
      [.allowedLateness(...)]    <-  optional: "lateness" (else zero)
      [.sideOutputLateData(...)] <-  optional: "output tag" (else no side output for late data)
       .reduce/aggregate/fold/apply()      <-  required: "function"
      [.getSideOutput(...)]      <-  optional: "output tag"
Non-Keyed Windows

stream
       .windowAll(...)           <-  required: "assigner"
      [.trigger(...)]            <-  optional: "trigger" (else default trigger)
      [.evictor(...)]            <-  optional: "evictor" (else no evictor)
      [.allowedLateness(...)]    <-  optional: "lateness" (else zero)
      [.sideOutputLateData(...)] <-  optional: "output tag" (else no side output for late data)
       .reduce/aggregate/fold/apply()      <-  required: "function"
      [.getSideOutput(...)]      <-  optional: "output tag"

窗口觸發可以有兩種條件,比方說達到了一定的數量或者水印(watermark)達到了條件。watermark是一種衡量Event Time進展的機制,

watermark是用於處理亂序事件的,而正確的處理亂序事件,通常用watermark機制結合window來實現。

我們知道,流處理從事件產生,到流經source,再到operator,中間是有一個過程和時間的。雖然大部分情況下,流到operator的數據都是按照事件產生的時間順序來的,但是也不排除由於網絡、背壓等原因,導致亂序的產生(out-of-order或者說late element)。

但是對於late element,我們又不能無限期的等下去,必須要有個機制來保證一個特定的時間後,必須觸發window去進行計算了。這個特別的機制,就是watermark。

我們從socket接收數據,然後經過map後立刻抽取timetamp並生成watermark,之後應用window來看看watermark和event time如何變化,才導致window被觸發的。

public class WindowsApp {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
        DataStreamSource<String> input = env.socketTextStream("127.0.0.1", 9999);
        //將數據流(key,時間戳組成的字符串)轉換成元組
        SingleOutputStreamOperator<Tuple2<String, Long>> inputMap = input.map(new MapFunction<String, Tuple2<String, Long>>() {
            @Override
            public Tuple2<String, Long> map(String value) throws Exception {
                String[] splits = value.split(",");
                return new Tuple2<>(splits[0], Long.parseLong(splits[1]));
            }
        });
        //提取時間戳,生成水印
        SingleOutputStreamOperator<Tuple2<String, Long>> watermarks = inputMap.setParallelism(1).assignTimestampsAndWatermarks(new AssignerWithPeriodicWatermarks<Tuple2<String, Long>>() {
            private Long currentMaxTimestamp = 0L;
            //最大允許的亂序時間爲10秒
            private Long maxOutOfOrderness = 10000L;
            private Watermark watermark;
            private SimpleDateFormat format = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS");

            @Nullable
            @Override
            public Watermark getCurrentWatermark() {
                watermark = new Watermark(currentMaxTimestamp - maxOutOfOrderness);
                return watermark;
            }

            @Override
            public long extractTimestamp(Tuple2<String, Long> element, long previousElementTimestamp) {
                Long timestamp = element.getField(1);
                currentMaxTimestamp = Math.max(timestamp, currentMaxTimestamp);
                System.out.println("timestamp:" + element.getField(0) + "," +
                        element.getField(1) + "|" + format.format(new Date((Long)element.getField(1))) +
                        "," + currentMaxTimestamp + "|" + format.format(new Date(currentMaxTimestamp)) +
                        "," + watermark.toString() + "|" + format.format(new Date(watermark.getTimestamp())));
                return timestamp;
            }
        });
        //根據水印的條件,來執行我們需要的方法
        //如果水印條件不滿足,該方法是永遠不會執行的
        watermarks.keyBy(x -> (String)x.getField(0)).timeWindow(Time.seconds(3))
                .apply(new WindowFunction<Tuple2<String,Long>, Tuple6<String,Integer,String,String,String,String>, String, TimeWindow>() {

                    @Override
                    public void apply(String key, TimeWindow window, Iterable<Tuple2<String, Long>> input, Collector<Tuple6<String, Integer, String, String, String, String>> out) throws Exception {
                        List<Tuple2<String,Long>> list = (List) input;
                        //將亂序進行有序整理
                        List<Long> collect = list.stream().map(x -> (Long)x.getField(1)).sorted().collect(Collectors.toList());
                        SimpleDateFormat format = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS");
                        out.collect(new Tuple6<>(key,list.size(),
                                format.format(collect.get(0)),format.format(collect.get(collect.size() - 1)),
                                format.format(window.getStart()),format.format(window.getEnd())));
                    }
                }).print().setParallelism(1);


        env.execute("WindowsApp");
    }
}

在控制檯執行nc -lk 9999後,運行我們的程序,控制檯輸入

000001,1461756862000

打印

timestamp:000001,1461756862000|2016-04-27 19:34:22.000,1461756862000|2016-04-27 19:34:22.000,Watermark @ -10000|1970-01-01 07:59:50.000

由該執行結果watermark = -10000,我們可以看出,水印是先獲取的,再執行時間戳的提取。

控制檯繼續輸入

000001,1461756866000

打印

timestamp:000001,1461756866000|2016-04-27 19:34:26.000,1461756866000|2016-04-27 19:34:26.000,Watermark @ 1461756852000|2016-04-27 19:34:12.000

由於水印是先獲取的,則此時的水印1461756852000|2016-04-27 19:34:12.000是第一次輸入所產生的。

控制檯繼續輸入

000001,1461756872000

打印

timestamp:000001,1461756872000|2016-04-27 19:34:32.000,1461756872000|2016-04-27 19:34:32.000,Watermark @ 1461756856000|2016-04-27 19:34:16.000

此時我們的時間戳來到了32秒,比第一個數據的時間多出了10秒。

控制檯繼續輸入

000001,1461756873000

打印

timestamp:000001,1461756873000|2016-04-27 19:34:33.000,1461756873000|2016-04-27 19:34:33.000,Watermark @ 1461756862000|2016-04-27 19:34:22.000

此時我們的時間戳來到了33秒,比第一個數據的時間多出了11秒。此時依然沒有觸發Windows窗體執行代碼。

控制檯繼續輸入

000001,1461756874000

打印

timestamp:000001,1461756874000|2016-04-27 19:34:34.000,1461756874000|2016-04-27 19:34:34.000,Watermark @ 1461756863000|2016-04-27 19:34:23.000
(000001,1,2016-04-27 19:34:22.000,2016-04-27 19:34:22.000,2016-04-27 19:34:21.000,2016-04-27 19:34:24.000)

此時觸發了Windows窗體執行代碼。輸出了一個六元組

控制檯繼續輸入

000001,1461756876000

打印

timestamp:000001,1461756876000|2016-04-27 19:34:36.000,1461756876000|2016-04-27 19:34:36.000,Watermark @ 1461756864000|2016-04-27 19:34:24.000

此時我們可以看到該水印是上一條數據會產生的,剛好在上一條數據的時間窗口內2016-04-27 19:34:22.000,2016-04-27 19:34:21.000,2016-04-27 19:34:24.000,觸發Windows執行代碼

則觸發條件爲

  1. watermark時間 >= window_end_time
  2. 在[window_start_time,window_end_time)中有數據存在

Scala代碼

import java.text.SimpleDateFormat

import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.functions.AssignerWithPeriodicWatermarks
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.streaming.api.scala.function.WindowFunction
import org.apache.flink.streaming.api.watermark.Watermark
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.windowing.windows.TimeWindow
import org.apache.flink.util.Collector
import org.apache.flink.api.scala._

object WindowsApp {
  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
    val input = env.socketTextStream("127.0.0.1",9999)
    val inputMap = input.map(f => {
      val splits = f.split(",")
      (splits(0), splits(1).toLong)
    })
    val watermarks = inputMap.setParallelism(1).assignTimestampsAndWatermarks(new AssignerWithPeriodicWatermarks[(String, Long)] {
      var currentMaxTimestamp = 0L
      var maxOutofOrderness = 10000L
      var watermark: Watermark = null
      val format = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS")

      override def getCurrentWatermark: Watermark = {
        watermark = new Watermark(currentMaxTimestamp - maxOutofOrderness)
        watermark
      }

      override def extractTimestamp(element: (String, Long), previousElementTimestamp: Long): Long = {
        val timestamp = element._2
        currentMaxTimestamp = Math.max(timestamp, currentMaxTimestamp)
        println("timestamp:" + element._1 + "," + element._2 + "|" + format.format(element._2) +
          "," + currentMaxTimestamp + "|" + format.format(currentMaxTimestamp) +
          "," + watermark.toString + "|" + format.format(watermark.getTimestamp))
        timestamp
      }
    })
    watermarks.keyBy(_._1).timeWindow(Time.seconds(3))
      .apply(new WindowFunction[(String,Long),(String,Int,String,String,String,String),String,TimeWindow] {
        override def apply(key: String, window: TimeWindow, input: Iterable[(String, Long)], out: Collector[(String, Int, String, String, String, String)]): Unit = {
          val list = input.toList.sortBy(_._2)
          val format = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS")
          out.collect((key,input.size,format.format(list.head._2),format.format(list.last._2),format.format(window.getStart),format.format(window.getEnd)))
        }
      }).print().setParallelism(1)
    env.execute("WindowsApp")
  }
}

滾動窗口和滑動窗口

滾動窗口就是一個不重疊的時間分片,落入到該時間分片的數據都會被該窗口計算。上面的例子就是一個滾動窗口

代碼中可以寫成

.window(TumblingEventTimeWindows.of(Time.seconds(5)))

也可以簡寫成

.timeWindow(Time.seconds(5))

滑動窗口是一個可以重疊的時間分片,同樣的數據可以落入不同的窗口,不同的窗口都會計算落入自己時間分片的數據。

代碼可以寫成

.window(SlidingEventTimeWindows.of(Time.seconds(10), Time.seconds(5)))

或者簡寫成

.timeWindow(Time.seconds(10),Time.seconds(5))
public class SliderWindowsApp {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        DataStreamSource<String> text = env.socketTextStream("127.0.0.1", 9999);
        text.flatMap(new FlatMapFunction<String, Tuple2<String,Integer>>() {
            @Override
            public void flatMap(String value, Collector<Tuple2<String, Integer>> out) throws Exception {
                String[] splits = value.split(",");
                Stream.of(splits).forEach(token -> out.collect(new Tuple2<>(token,1)));
            }
        }).keyBy(0).timeWindow(Time.seconds(10),Time.seconds(5))
                .sum(1).print().setParallelism(1);
        env.execute("SliderWindowsApp");
    }
}

控制檯輸入

a,b,c,d,e,f
a,b,c,d,e,f
a,b,c,d,e,f

運行結果

(d,3)
(a,3)
(e,3)
(f,3)
(c,3)
(b,3)
(c,3)
(f,3)
(b,3)
(d,3)
(e,3)
(a,3)

從結果我們可以看到,數據被運算了兩次

Scala代碼

import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.api.scala._

object SliderWindowsApp {
  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    val text = env.socketTextStream("127.0.0.1",9999)
    text.flatMap(_.split(","))
      .map((_,1))
      .keyBy(0)
      .timeWindow(Time.seconds(10),Time.seconds(5))
      .sum(1)
      .print()
      .setParallelism(1)
    env.execute("SliderWindowsApp")
  }
}

Windows Functions

RedueFunction

這是一個增量函數,即它不會把時間窗口內的所有數據統一處理,只會一條一條處理

public class ReduceWindowsApp {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        DataStreamSource<String> text = env.socketTextStream("127.0.0.1", 9999);
        text.flatMap((FlatMapFunction<String,String>)(f, collector) -> {
            String[] splits = f.split(",");
            Stream.of(splits).forEach(collector::collect);
        }).returns(String.class)
                .map(new MapFunction<String, Tuple2<Integer,Integer>>() {
                    @Override
                    public Tuple2<Integer, Integer> map(String value) throws Exception {
                        return new Tuple2<>(1,Integer.parseInt(value));
                    }
                })
                .keyBy(0)
                .timeWindow(Time.seconds(5))
                .reduce((x,y) -> new Tuple2<>(x.getField(0),(int)x.getField(1) + (int)y.getField(1)))
                .print().setParallelism(1);
        env.execute("ReduceWindowsApp");
    }
}

控制檯輸入

1,2,3,4,5
7,8,9

運行結果

(1,15)
(1,24)

Scala代碼

import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.api.scala._

object ReduceWindowsApp {
  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    val text = env.socketTextStream("127.0.0.1",9999)
    text.flatMap(_.split(","))
      .map(x => (1,x.toInt))
      .keyBy(0)
      .timeWindow(Time.seconds(5))
      .reduce((x,y) => (x._1,x._2 + y._2))
      .print()
      .setParallelism(1)
    env.execute("ReduceWindowsApp")
  }
}

ProcessFunction

這是一個全量函數,即它會把一個時間窗口內的所有數據一起處理

public class ProcessWindowsApp {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        DataStreamSource<String> text = env.socketTextStream("127.0.0.1", 9999);
        text.flatMap((FlatMapFunction<String,String>)(f, collector) -> {
            String[] splits = f.split(",");
            Stream.of(splits).forEach(collector::collect);
        }).returns(String.class)
                .map(new MapFunction<String, Tuple2<Integer,Integer>>() {
                    @Override
                    public Tuple2<Integer, Integer> map(String value) throws Exception {
                        return new Tuple2<>(1,Integer.parseInt(value));
                    }
                })
                .keyBy(0)
                .timeWindow(Time.seconds(5))
                .process(new ProcessWindowFunction<Tuple2<Integer,Integer>, Tuple2<Integer,Integer>, Tuple, TimeWindow>() {
                    @Override
                    public void process(Tuple tuple, Context context, Iterable<Tuple2<Integer, Integer>> elements, Collector<Tuple2<Integer, Integer>> out) throws Exception {
                        List<Tuple2<Integer,Integer>> list = (List) elements;
                        out.collect(list.stream().reduce((x, y) -> new Tuple2<>(x.getField(0), (int) x.getField(1) + (int) y.getField(1)))
                                .get());
                    }
                }).print().setParallelism(1);
        env.execute("ProcessWindowsApp");
    }
}

控制檯輸入

1,2,3,4,5
7,8,9

運行結果

(1,39)

Scala代碼

import org.apache.flink.api.java.tuple.Tuple
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.streaming.api.scala.function.ProcessWindowFunction
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.windowing.windows.TimeWindow
import org.apache.flink.util.Collector
import org.apache.flink.api.scala._

object ProcessWindowsApp {
  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    val text = env.socketTextStream("127.0.0.1",9999)
    text.flatMap(_.split(","))
      .map(x => (1,x.toInt))
      .keyBy(0)
      .timeWindow(Time.seconds(5))
        .process(new ProcessWindowFunction[(Int,Int),(Int,Int),Tuple,TimeWindow] {
          override def process(key: Tuple, context: Context, elements: Iterable[(Int, Int)], out: Collector[(Int, Int)]): Unit = {
            val list = elements.toList
            out.collect(list.reduce((x,y) => (x._1,x._2 + y._2)))
          }
        })
      .print()
      .setParallelism(1)
    env.execute("ReduceWindowsApp")
  }
}

Connector

Flink提供了很多內置的數據源或者輸出的連接Connector,當前包括的有

Apache Kafka (source/sink)
Apache Cassandra (sink)
Amazon Kinesis Streams (source/sink)
Elasticsearch (sink)
Hadoop FileSystem (sink)
RabbitMQ (source/sink)
Apache NiFi (source/sink)
Twitter Streaming API (source)

HDFS Connector

這是一個把數據流輸出到Hadoop HDFS分佈式文件系統的連接,要使用該連接,需要添加以下依賴

<dependency>
   <groupId>org.apache.flink</groupId>
   <artifactId>flink-connector-filesystem_2.11</artifactId>
   <version>${flink.version}</version>
</dependency>
<dependency>
   <groupId>org.apache.hadoop</groupId>
   <artifactId>hadoop-client</artifactId>
   <version>${hadoop.version}</version>
</dependency>

版本號根據自己實際情況來選擇,我這裏hadoop的版本號爲

<hadoop.version>2.8.1</hadoop.version>

在data下新建一個hdfssink的文件夾,此時文件夾內容爲空

public class FileSystemSinkApp {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        DataStreamSource<String> data = env.socketTextStream("127.0.0.1", 9999);
        String filePath = "/Users/admin/Downloads/flink/data/hdfssink";
        BucketingSink<String> sink = new BucketingSink<>(filePath);
        sink.setBucketer(new DateTimeBucketer<>("yyyy-MM-dd--HHmm"));
        sink.setWriter(new StringWriter<>());
//        sink.setBatchSize(1024 * 1024 * 400);
        sink.setBatchRolloverInterval(20);
        data.addSink(sink);
        env.execute("FileSystemSinkApp");
    }
}

我這裏並沒有真正使用hadoop的hdfs,hdfs的搭建可以參考Hadoop hdfs+Spark配置 。而是本地目錄,在控制檯隨便輸入

adf
dsdf
wfdgg

我們可以看到在hdfssink文件夾下面多了一個

2021-01-15--0627

的文件夾,進入該文件夾後可以看到3個文件

_part-4-0.pending    _part-5-0.pending    _part-6-0.pending

查看三個文件

(base) -bash-3.2$ cat _part-4-0.pending
adf

(base) -bash-3.2$ cat _part-5-0.pending
dsdf

(base) -bash-3.2$ cat _part-6-0.pending
wfdgg

BucketingSink其實是RichSinkFunction抽象類的子類,跟之前寫的自定義Sink的SinkToMySQL是一樣的。

Scala代碼

import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.streaming.connectors.fs.StringWriter
import org.apache.flink.streaming.connectors.fs.bucketing.{BucketingSink, DateTimeBucketer}

object FileSystemSinkApp {
  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    val data = env.socketTextStream("127.0.0.1",9999)
    val filePath = "/Users/admin/Downloads/flink/data/hdfssink"
    val sink = new BucketingSink[String](filePath)
    sink.setBucketer(new DateTimeBucketer[String]("yyyy-MM-dd--HHmm"))
    sink.setWriter(new StringWriter[String]())
//    sink.setBatchSize(1024 * 1024 * 400)
    sink.setBatchRolloverInterval(20)
    data.addSink(sink)
    env.execute("FileSystemSinkApp")
  }
}

Kafka Connector

要使用Kafka Connector,當然首先必須安裝Kafka。先安裝一個zookeeper 3.4.5,kafka 1.1.1

由於我的Kafka是安裝在阿里雲上面的,本地訪問需要配置一下,在kafka的config目錄下修改server.properties

advertised.listeners=PLAINTEXT://外網IP:9092
host.name=內網IP

同時阿里雲需要開放9092端口

kafka啓動

 ./kafka-server-start.sh -daemon ../config/server.properties 

創建topic

 ./kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic pktest

此時我們進入zookeeper可以看到該topic

[zk: localhost:2181(CONNECTED) 2] ls /brokers/topics
[pktest, __consumer_offsets]

查看topic

./kafka-topics.sh --list --zookeeper localhost:2181

該命令返回的結果爲

[bin]# ./kafka-topics.sh --list --zookeeper localhost:2181
__consumer_offsets
pktest

啓動生產者

./kafka-console-producer.sh --broker-list localhost:9092 --topic pktest

但由於我們是在阿里雲上面啓動,則啓動生產者需要更改爲

./kafka-console-producer.sh --broker-list 外網ip:9092 --topic pktest

啓動消費者

./kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic pktest

但由於我們是在阿里雲上面啓動,則啓動消費者需要更改爲

./kafka-console-consumer.sh --bootstrap-server 外網ip:9092 --topic pktest

此時我們在生產者窗口輸入,消費者窗口這邊就會獲取

Kafka作爲Source代碼,添加依賴

<dependency>
   <groupId>org.apache.flink</groupId>
   <artifactId>flink-connector-kafka_2.11</artifactId>
   <version>${flink.version}</version>
</dependency>
<dependency>
   <groupId>org.apache.kafka</groupId>
   <artifactId>kafka-clients</artifactId>
   <version>1.1.1</version>
</dependency>
public class KafkaConnectorConsumerApp {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.enableCheckpointing(4000);
        env.getCheckpointConfig().setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE);
        env.getCheckpointConfig().setCheckpointTimeout(10000);
        env.getCheckpointConfig().setMaxConcurrentCheckpoints(1);
        String topic = "pktest";
        Properties properties = new Properties();
        properties.setProperty("bootstrap.servers","外網ip:9092");
        properties.setProperty("group.id","test");
        DataStreamSource<String> data = env.addSource(new FlinkKafkaConsumer<>(topic, new SimpleStringSchema(), properties));
        data.print().setParallelism(1);
        env.execute("KafkaConnectorConsumerApp");
    }
}

運行結果

服務器輸入

[bin]# ./kafka-console-producer.sh --broker-list 外網ip:9092 --topic pktest       
>sdfa

打印

sdfa

Scala代碼

import java.util.Properties

import org.apache.flink.api.common.serialization.SimpleStringSchema
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer
import org.apache.flink.api.scala._
import org.apache.flink.streaming.api.CheckpointingMode

object KafkaConnectorConsumerApp {
  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    env.enableCheckpointing(4000)
    env.getCheckpointConfig.setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE)
    env.getCheckpointConfig.setCheckpointTimeout(10000)
    env.getCheckpointConfig.setMaxConcurrentCheckpoints(1)
    val topic = "pktest"
    val properties = new Properties
    properties.setProperty("bootstrap.servers", "外網ip:9092")
    properties.setProperty("group.id","test")
    val data = env.addSource(new FlinkKafkaConsumer[String](topic, new SimpleStringSchema, properties))
    data.print().setParallelism(1)
    env.execute("KafkaConnectorConsumerApp")
  }
}

Kafka作爲Sink

public class KafkaConnectorProducerApp {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.enableCheckpointing(4000);
        env.getCheckpointConfig().setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE);
        env.getCheckpointConfig().setCheckpointTimeout(10000);
        env.getCheckpointConfig().setMaxConcurrentCheckpoints(1);
        DataStreamSource<String> data = env.socketTextStream("127.0.0.1", 9999);
        String topic = "pktest";
        Properties properties = new Properties();
        properties.setProperty("bootstrap.servers","外網ip:9092");
        FlinkKafkaProducer<String> kafkaSink = new FlinkKafkaProducer<>(topic,
                new KeyedSerializationSchemaWrapper<>(new SimpleStringSchema()),properties,
                FlinkKafkaProducer.Semantic.EXACTLY_ONCE);
        data.addSink(kafkaSink);
        env.execute("KafkaConnectorProducerApp");
    }
}

控制檯輸入

sdfae
dffe

服務器打印

[bin]# ./kafka-console-consumer.sh --bootstrap-server 外網ip:9092 --topic pktest      
sdfae
dffe

Scala代碼

import java.util.Properties

import org.apache.flink.api.common.serialization.SimpleStringSchema
import org.apache.flink.streaming.api.CheckpointingMode
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer
import org.apache.flink.streaming.util.serialization.KeyedSerializationSchemaWrapper

object KafkaConnectorProducerApp {
  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    env.enableCheckpointing(4000)
    env.getCheckpointConfig.setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE)
    env.getCheckpointConfig.setCheckpointTimeout(10000)
    env.getCheckpointConfig.setMaxConcurrentCheckpoints(1)
    val data = env.socketTextStream("127.0.0.1",9999)
    val topic = "pktest"
    val properties = new Properties
    properties.setProperty("bootstrap.servers", "外網ip:9092")
    val kafkaSink = new FlinkKafkaProducer[String](topic,
      new KeyedSerializationSchemaWrapper[String](new SimpleStringSchema),properties,
      FlinkKafkaProducer.Semantic.EXACTLY_ONCE)
    data.addSink(kafkaSink)
    env.execute("KafkaConnectorProducerApp")
  }
}

部署

單機部署

下載地址:https://archive.apache.org/dist/flink/flink-1.7.2/flink-1.7.2-bin-hadoop28-scala_2.11.tgz

由於我這裏用的是1.7.2(當然你可以使用其他版本),下載解壓縮後,進入bin目錄,執行

./start-cluster.sh

進入web界面 http://外網ip:8081/

提交一個測試用例

nc -lk 9000

退出bin目錄,返回上級目錄,執行

./bin/flink run examples/streaming/SocketWindowWordCount.jar --port 9000

此時在web界面中可以看到這個運行的任務

點RUNNING按鈕見到的如下

雖然我們在nc中敲入一些字符,比如

a       f       g       r       a       d
a       f       g       r       a       d
a       f       g       r       a       d

但並沒有打印的地方,我們查看結果需要在log目錄下查看

[log]# cat flink-root-taskexecutor-0-iZ7xvi8yoh0wspvk6rjp7cZ.out
a : 6
d : 3
r : 3
g : 3
f : 3
 : 90

上傳我們自己的jar包

上傳之前,修改一下我們需要運行的main方法的類

<transformers>
   <transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
      <mainClass>com.guanjian.flink.java.test.StreamingJavaApp</mainClass>
   </transformer>
</transformers>

由於我們代碼的端口爲9999,執行

nc -lk 9999

上傳後執行(上傳至flink的新建test目錄下)

 ./bin/flink run test/flink-train-java-1.0.jar

nc下輸入

a d e g a d g f

在log下執行

cat flink-root-taskexecutor-0-iZ7xvi8yoh0wspvk6rjp7cZ.out

可以看到除了之前的記錄,多出了幾條新的記錄

a : 6
d : 3
r : 3
g : 3
f : 3
 : 90
(a,2)
(f,1)
(g,2)
(e,1)
(d,2)

Yarn集羣部署

要進行Yarn集羣部署,得要先安裝Hadoop,我這裏Hadoop的版本爲2.8.1

進入Hadoop安裝目錄下的etc/hadoop文件夾

首先依然是hadoop-env.sh的配置,需要配置一下JAVA_HOME

export JAVA_HOME=/home/soft/jdk1.8.0_161

core-site.xml

<configuration>
<property>
    <name>fs.defaultFS</name>
    <value>hdfs://host1:8020</value>
</property>
</configuration>

hdfs-site.xml

<configuration>
<property>
    <name>dfs.namenode.name.dir</name>
    <value>/opt/hadoop2/tmp/dfs/name</value>
</property>
<property>
    <name>dfs.datanode.data.dir</name>
    <value>/opt/hadoop2/tmp/dfs/data</value>
</property>
</configuration>

此時需要新建這兩個目錄

mkdir -p /opt/hadoop2/tmp/dfs/name
mkdir -p /opt/hadoop2/tmp/dfs/data

yarn-site.xml

<configuration>
<!-- Site specific YARN configuration properties -->
<property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
</property>
<property>
    <name>yarn.resourcemanager.hostname</name>
    <value>host1</value>
</property>
<property> 
    <name>yarn.nodemanager.vmem-check-enabled</name> 
    <value>false</value> 
</property> 
</configuration>

mapred-site.xml

<configuration>
<property>
    <name>mapreduce.framework.name</name>
    <value>yarn</value>
</property>
</configuration>

啓動後,可以看到50070的hdfs頁面以及8088的Yarn頁面

在進行Flink的Yarn部署前需要配置HADOOP_HOME,此處包括JAVA_HOME

vim /etc/profile

JAVA_HOME=/home/soft/jdk1.8.0_161
HADOOP_HOME=/home/soft/hadoop-2.8.1
JRE_HOME=${JAVA_HOME}/jre
CLASSPATH=.:${JAVA_HOME}/lib:${JRE_HOME}/lib:${HADOOP_HOME}/bin:${HADOOP_HOME}/sbin
PATH=${JAVA_HOME}/bin:${HADOOP_HOME}/bin:$PATH
export JAVA_HOME HADOOP_HOME PATH

保存後,source /etc/profile

第一種Yarn部署

在flink的bin目錄下

./yarn-session.sh -n 1 -jm 1024m -tm 1024m

-n :    taskManager的數量

-jm:   jobManager的內存

-tm:  taskManager的內存

此時在Yarn的Web頁面(8088端口)可以看到

在我們的訪問機上的/etc/hosts配置好host1的IP地址後,點擊ApplicationMaster進入Flink的管理頁面

提交代碼任務

上傳一份文件到hdfs的根目錄

hdfs dfs -put LICENSE-2.0.txt /

提交代碼任務,在flink的bin目錄下

./flink run ../examples/batch/WordCount.jar -input hdfs://host1:8020/LICENSE-2.0.txt -output hdfs://host1:8020/wordcount-result.txt

運算完成後,查看hdfs的文件

[bin]# hdfs dfs -ls /
Found 4 items
-rw-r--r--   3 root supergroup      11358 2021-01-24 14:47 /LICENSE-2.0.txt
drwxr-xr-x   - root supergroup          0 2021-01-24 09:43 /abcd
drwxr-xr-x   - root supergroup          0 2021-01-24 14:17 /user
-rw-r--r--   3 root supergroup       4499 2021-01-24 15:06 /wordcount-result.txt

在Flink的頁面也可以看到

第二種Yarn部署

要進行第二種Yarn部署,我們需要先取消第一種的配置

yarn application -kill application_1611471412139_0001

在flink的bin目錄下

./flink run -m yarn-cluster -yn 1 ../examples/batch/WordCount.jar

-m :    yarn集羣,yarn-cluster爲常量

-yn:    taskManager的數量

此時在Yarn的Web界面也可以看到

提交我們自己的任務,將代碼socket的IP改成host1

public class StreamingJavaApp {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        DataStreamSource<String> text = env.socketTextStream("host1",9999);
        text.flatMap(new FlatMapFunction<String, Tuple2<String,Integer>>() {
            @Override
            public void flatMap(String value, Collector<Tuple2<String, Integer>> collector) throws Exception {
                String[] tokens = value.toLowerCase().split(" ");
                for (String token : tokens) {
                    collector.collect(new Tuple2<>(token,1));
                }
            }
        }).keyBy(0).timeWindow(Time.seconds(5))
                .sum(1).print();
        env.execute("StreamingJavaApp");
    }
}

啓動nc

nc -lk 9999

./flink run -m yarn-cluster -yn 1 ../test/flink-train-java-1.0.jar

輸入

[~]# nc -lk 9999              
a v d t e a d e f
a v d t e a d e f
a v d t e a d e f
a v d t e a d e f

查看結果

在Flink的Web界面上

State

  1. State是指某一個具體的Task/Operator的狀態
  2. State數據默認存放在JVM中
  3. 分類:Keyed State & Operator State

Keyed State

/**
 * 從一組數據中,每兩個數據統計一次平均值
 */
public class KeyedStateApp {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.fromCollection(Arrays.asList(new Tuple2<>(1,3),
                new Tuple2<>(1,5),
                new Tuple2<>(1,7),
                new Tuple2<>(1,4),
                new Tuple2<>(1,2)))
                .keyBy(ele -> ele.getField(0))
                .flatMap(new RichFlatMapFunction<Tuple2<Integer,Integer>, Tuple2<Integer,Integer>>() {
                    private transient ValueState<Tuple2<Integer,Integer>> state;

                    @Override
                    public void open(Configuration parameters) throws Exception {
                        super.open(parameters);
                        state = getRuntimeContext().getState(new ValueStateDescriptor<>("avg",
                                TypeInformation.of(new TypeHint<Tuple2<Integer, Integer>>() {
                                })));
                    }

                    @Override
                    public void flatMap(Tuple2<Integer, Integer> value, Collector<Tuple2<Integer, Integer>> out) throws Exception {
                        Tuple2<Integer, Integer> tmpState = state.value();
                        Tuple2<Integer,Integer> currentState = tmpState == null ? Tuple2.of(0,0) : tmpState;
                        Tuple2<Integer,Integer> newState = new Tuple2<>((int) currentState.getField(0) + 1,(int) currentState.getField(1) + (int) value.getField(1));
                        state.update(newState);
                        if ((int) newState.getField(0) >= 2) {
                            out.collect(new Tuple2<>(value.getField(0),(int) newState.getField(1) / (int) newState.getField(0)));
                            state.clear();
                        }
                    }
                }).print().setParallelism(1);
        env.execute("KeyedStateApp");
    }
}

運行結果

(1,4)
(1,5)

Scala代碼

import org.apache.flink.api.common.functions.RichFlatMapFunction
import org.apache.flink.api.common.state.{ValueState, ValueStateDescriptor}
import org.apache.flink.configuration.Configuration
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.util.Collector
import org.apache.flink.api.scala.createTypeInformation

object KeyedStateApp {
  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    env.fromCollection(List((1,3),(1,5),(1,7),(1,4),(1,2)))
      .keyBy(_._1)
      .flatMap(new RichFlatMapFunction[(Int,Int),(Int,Int)] {
        var state: ValueState[(Int,Int)] = _

        override def open(parameters: Configuration): Unit = {
          state = getRuntimeContext.getState(new ValueStateDescriptor[(Int, Int)]("avg",
            createTypeInformation[(Int,Int)]))
        }

        override def flatMap(value: (Int, Int), out: Collector[(Int, Int)]) = {
          val tmpState = state.value()
          val currentState = if (tmpState != null) {
            tmpState
          } else {
            (0,0)
          }
          val newState = (currentState._1 + 1,currentState._2 + value._2)
          state.update(newState)
          if (newState._1 >= 2) {
            out.collect((value._1,newState._2 / newState._1))
            state.clear()
          }
        }
      }).print().setParallelism(1)
    env.execute("KeyedStateApp")
  }
}

運行結果

(1,4)
(1,5)

Reducing State 

/**
 * 統計數據條數,並加總
 */
public class ReducingStateApp {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.fromCollection(Arrays.asList(new Tuple2<>(1,3),
                new Tuple2<>(1,5),
                new Tuple2<>(1,7),
                new Tuple2<>(1,4),
                new Tuple2<>(1,2)))
                .keyBy(ele -> ele.getField(0))
                .flatMap(new RichFlatMapFunction<Tuple2<Integer,Integer>, Tuple2<Integer,Integer>>() {
                    private transient ReducingState<Tuple2<Integer,Integer>> state;

                    @Override
                    public void open(Configuration parameters) throws Exception {
                        super.open(parameters);
                        state = getRuntimeContext().getReducingState(new ReducingStateDescriptor<>("sum",
                                new ReduceFunction<Tuple2<Integer, Integer>>() {
                                    @Override
                                    public Tuple2<Integer, Integer> reduce(Tuple2<Integer, Integer> value1, Tuple2<Integer, Integer> value2) throws Exception {
                                        Tuple2<Integer,Integer> tuple2 = new Tuple2<>((int) value1.getField(0) + 1,
                                                (int) value1.getField(1) + (int) value2.getField(1));
                                        return tuple2;
                                    }
                                }, TypeInformation.of(new TypeHint<Tuple2<Integer, Integer>>() {})));
                    }

                    @Override
                    public void flatMap(Tuple2<Integer, Integer> value, Collector<Tuple2<Integer, Integer>> out) throws Exception {
                        Tuple2<Integer,Integer> tuple2 = new Tuple2<>(value.getField(0),
                                value.getField(1));
                        state.add(tuple2);
                        out.collect(new Tuple2<>(state.get().getField(0),state.get().getField(1)));
                    }
                }).print().setParallelism(1);
        env.execute("ReducingStateApp");
    }
}

運行結果

(2,8)
(3,15)
(4,19)
(5,21)

Scala代碼

import org.apache.flink.api.common.functions.{ReduceFunction, RichFlatMapFunction}
import org.apache.flink.api.common.state.{ReducingState, ReducingStateDescriptor}
import org.apache.flink.api.scala.createTypeInformation
import org.apache.flink.configuration.Configuration
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.util.Collector

object ReducingStateApp {
  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    env.fromCollection(List((1,3),(1,5),(1,7),(1,4),(1,2)))
      .keyBy(_._1)
      .flatMap(new RichFlatMapFunction[(Int,Int),(Int,Int)] {
        var state: ReducingState[(Int,Int)] = _

        override def open(parameters: Configuration): Unit = {
          state = getRuntimeContext.getReducingState(new ReducingStateDescriptor[(Int, Int)]("sum",
            new ReduceFunction[(Int, Int)] {
              override def reduce(value1: (Int, Int), value2: (Int, Int)): (Int, Int) = {
                (value1._1 + 1,value1._2 + value2._2)
              }
            },
            createTypeInformation[(Int,Int)]))
        }

        override def flatMap(value: (Int, Int), out: Collector[(Int, Int)]) = {
          val tuple2 = (value._1,value._2)
          state.add(tuple2)
          out.collect((state.get()._1,state.get()._2))
        }
      }).print().setParallelism(1)
    env.execute("ReducingStateApp")
  }
}

運行結果

(2,8)
(3,15)
(4,19)
(5,21)

 List State

/**
 * 獲取每一條所在的位置
 */
public class ListStateApp {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.fromCollection(Arrays.asList(new Tuple2<>(1,3),
                new Tuple2<>(1,5),
                new Tuple2<>(1,7),
                new Tuple2<>(1,4),
                new Tuple2<>(1,2)))
                .keyBy(ele -> ele.getField(0))
                .flatMap(new RichFlatMapFunction<Tuple2<Integer,Integer>, Tuple3<Integer,Integer,Integer>>() {
                    private transient ListState<Integer> state;

                    @Override
                    public void open(Configuration parameters) throws Exception {
                        super.open(parameters);
                        state = getRuntimeContext().getListState(new ListStateDescriptor<>("list",
                                TypeInformation.of(new TypeHint<Integer>() {})));
                    }

                    @Override
                    public void flatMap(Tuple2<Integer, Integer> value, Collector<Tuple3<Integer, Integer,Integer>> out) throws Exception {
                        state.add(value.getField(0));
                        Iterator<Integer> iterator = state.get().iterator();
                        Integer l = 0;
                        while (iterator.hasNext()) {
                            l += iterator.next();
                        }
                        Tuple3<Integer,Integer,Integer> tuple3 = new Tuple3<>(value.getField(0),value.getField(1),l);
                        out.collect(tuple3);
                    }
                }).print().setParallelism(1);
        env.execute("ListStateApp");
    }
}

運行結果

(1,3,1)
(1,5,2)
(1,7,3)
(1,4,4)
(1,2,5)

Scala代碼

import org.apache.flink.api.common.functions.RichFlatMapFunction
import org.apache.flink.api.common.state.{ListState, ListStateDescriptor}
import org.apache.flink.api.scala.createTypeInformation
import org.apache.flink.configuration.Configuration
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.util.Collector

object ListStateApp {
  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    env.fromCollection(List((1,3),(1,5),(1,7),(1,4),(1,2)))
      .keyBy(_._1)
      .flatMap(new RichFlatMapFunction[(Int,Int),(Int,Int,Int)] {
        var state: ListState[Int] = _

        override def open(parameters: Configuration): Unit = {
          state = getRuntimeContext.getListState(new ListStateDescriptor[Int]("list",
            createTypeInformation[Int]));
        }

        override def flatMap(value: (Int, Int), out: Collector[(Int, Int, Int)]) = {
          state.add(value._1)
          val iterator = state.get().iterator()
          var l: Int = 0
          while (iterator.hasNext) {
            l += iterator.next()
          }
          val tuple3 = (value._1,value._2,l)
          out.collect(tuple3)
        }
      }).print().setParallelism(1)
    env.execute("ListStateApp")
  }
}

運行結果

(1,3,1)
(1,5,2)
(1,7,3)
(1,4,4)
(1,2,5)

Fold State 

/**
 * 從某個初始值開始統計條數
 */
public class FoldStateApp {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.fromCollection(Arrays.asList(new Tuple2<>(1,3),
                new Tuple2<>(1,5),
                new Tuple2<>(1,7),
                new Tuple2<>(1,4),
                new Tuple2<>(1,2)))
                .keyBy(ele -> ele.getField(0))
                .flatMap(new RichFlatMapFunction<Tuple2<Integer,Integer>, Tuple3<Integer,Integer,Integer>>() {
                    private transient FoldingState<Integer,Integer> state;

                    @Override
                    public void open(Configuration parameters) throws Exception {
                        super.open(parameters);
                        state = getRuntimeContext().getFoldingState(new FoldingStateDescriptor<Integer, Integer>("fold",
                                1, (accumulator, value) -> accumulator + value,
                                TypeInformation.of(new TypeHint<Integer>() {})
                        ));
                    }

                    @Override
                    public void flatMap(Tuple2<Integer, Integer> value, Collector<Tuple3<Integer, Integer,Integer>> out) throws Exception {
                        state.add(value.getField(0));
                        out.collect(new Tuple3<>(value.getField(0),value.getField(1),state.get()));
                    }
                }).print().setParallelism(1);
        env.execute("FoldStateApp");
    }
}

運行結果

(1,3,2)
(1,5,3)
(1,7,4)
(1,4,5)
(1,2,6)

Scala代碼

import org.apache.flink.api.common.functions.{FoldFunction, RichFlatMapFunction}
import org.apache.flink.api.common.state.{FoldingState, FoldingStateDescriptor}
import org.apache.flink.api.scala.createTypeInformation
import org.apache.flink.configuration.Configuration
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.util.Collector

object FoldStateApp {
  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    env.fromCollection(List((1,3),(1,5),(1,7),(1,4),(1,2)))
      .keyBy(_._1)
      .flatMap(new RichFlatMapFunction[(Int,Int),(Int,Int,Int)] {
        var state: FoldingState[Int,Int] = _

        override def open(parameters: Configuration): Unit = {
          state = getRuntimeContext.getFoldingState(new FoldingStateDescriptor[Int,Int]("fold",
            1,new FoldFunction[Int,Int] {
              override def fold(accumulator: Int, value: Int) = {
                accumulator + value
              }
            }, createTypeInformation[Int]))
        }

        override def flatMap(value: (Int, Int), out: Collector[(Int, Int, Int)]) = {
          state.add(value._1)
          out.collect((value._1,value._2,state.get()))
        }
      }).print().setParallelism(1)
    env.execute("FoldStateApp")
  }
}

運行結果

(1,3,2)
(1,5,3)
(1,7,4)
(1,4,5)
(1,2,6)

 Map State

/**
 * 將每一條的數據加上上一條的數據,第一條保持自身
 */
public class MapStateApp {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.fromCollection(Arrays.asList(new Tuple2<>(1,3),
                new Tuple2<>(1,5),
                new Tuple2<>(1,7),
                new Tuple2<>(1,4),
                new Tuple2<>(1,2)))
                .keyBy(ele -> ele.getField(0))
                .flatMap(new RichFlatMapFunction<Tuple2<Integer,Integer>, Tuple2<Integer,Integer>>() {
                    private transient MapState<Integer,Integer> state;

                    @Override
                    public void open(Configuration parameters) throws Exception {
                        super.open(parameters);
                        state = getRuntimeContext().getMapState(new MapStateDescriptor<>("map",
                                TypeInformation.of(new TypeHint<Integer>() {}),
                                TypeInformation.of(new TypeHint<Integer>() {})));
                    }

                    @Override
                    public void flatMap(Tuple2<Integer, Integer> value, Collector<Tuple2<Integer, Integer>> out) throws Exception {
                        Integer tmp = state.get(value.getField(0));
                        Integer current = tmp == null ? 0 : tmp;
                        state.put(value.getField(0),value.getField(1));
                        Tuple2<Integer,Integer> tuple2 = new Tuple2<>(value.getField(0),
                                current + (int) value.getField(1));
                        out.collect(tuple2);
                    }
                }).print().setParallelism(1);
        env.execute("MapStateApp");
    }
}

運行結果

(1,3)
(1,8)
(1,12)
(1,11)
(1,6)

Scala代碼

import org.apache.flink.api.common.functions.RichFlatMapFunction
import org.apache.flink.api.common.state.{MapState, MapStateDescriptor}
import org.apache.flink.configuration.Configuration
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.util.Collector
import org.apache.flink.api.scala.createTypeInformation

object MapStateApp {
  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    env.fromCollection(List((1,3),(1,5),(1,7),(1,4),(1,2)))
      .keyBy(_._1)
      .flatMap(new RichFlatMapFunction[(Int,Int),(Int,Int)] {
        var state: MapState[Int,Int] = _

        override def open(parameters: Configuration): Unit = {
          state = getRuntimeContext.getMapState(new MapStateDescriptor[Int,Int]("map",
            createTypeInformation[Int],createTypeInformation[Int]))
        }

        override def flatMap(value: (Int, Int), out: Collector[(Int, Int)]) = {
          val tmp: Int = state.get(value._1)
          val current: Int = if (tmp == null) {
            0
          } else {
            tmp
          }
          state.put(value._1,value._2)
          val tuple2 = (value._1,current + value._2)
          out.collect(tuple2)
        }
      }).print().setParallelism(1)
    env.execute("MapStateApp")
  }
}

運行結果

(1,3)
(1,8)
(1,12)
(1,11)
(1,6)

 Aggregating State

/**
 * 求每一條數據跟之前所有數據的平均值
 */
public class AggregatingStateApp {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.fromCollection(Arrays.asList(new Tuple2<>(1,3),
                new Tuple2<>(1,5),
                new Tuple2<>(1,7),
                new Tuple2<>(1,4),
                new Tuple2<>(1,2)))
                .keyBy(ele -> ele.getField(0))
                .flatMap(new RichFlatMapFunction<Tuple2<Integer,Integer>, Tuple2<Integer,Integer>>() {
                    private transient AggregatingState<Integer,Integer> state;

                    @Override
                    public void open(Configuration parameters) throws Exception {
                        super.open(parameters);
                        state = getRuntimeContext().getAggregatingState(new AggregatingStateDescriptor<>("agg",
                                new AggregateFunction<Integer, Tuple2<Integer,Integer>, Integer>() {
                                    @Override
                                    public Tuple2<Integer, Integer> createAccumulator() {
                                        return new Tuple2<>(0,0);
                                    }

                                    @Override
                                    public Tuple2<Integer, Integer> add(Integer value, Tuple2<Integer, Integer> accumulator) {
                                        return new Tuple2<>((int) accumulator.getField(0) + value,
                                                (int) accumulator.getField(1) + 1);
                                    }

                                    @Override
                                    public Integer getResult(Tuple2<Integer, Integer> accumulator) {
                                        return (int) accumulator.getField(0) / (int) accumulator.getField(1);
                                    }

                                    @Override
                                    public Tuple2<Integer, Integer> merge(Tuple2<Integer, Integer> a, Tuple2<Integer, Integer> b) {
                                        return new Tuple2<>((int) a.getField(0) + (int) b.getField(0),
                                                (int) a.getField(1) + (int) b.getField(1));
                                    }
                                }, TypeInformation.of(new TypeHint<Tuple2<Integer,Integer>>() {})));
                    }

                    @Override
                    public void flatMap(Tuple2<Integer, Integer> value, Collector<Tuple2<Integer, Integer>> out) throws Exception {
                        state.add(value.getField(1));
                        Tuple2<Integer,Integer> tuple2 = new Tuple2<>(value.getField(0),
                                state.get());
                        out.collect(tuple2);
                    }
                }).print().setParallelism(1);
        env.execute("AggregatingStateApp");
    }
}

運行結果

(1,3)
(1,4)
(1,5)
(1,4)
(1,4)

Scala代碼

import org.apache.flink.api.common.functions.{AggregateFunction, RichFlatMapFunction}
import org.apache.flink.api.common.state.{AggregatingState, AggregatingStateDescriptor}
import org.apache.flink.api.scala.createTypeInformation
import org.apache.flink.configuration.Configuration
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.util.Collector

object AggregatingStateApp {
  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    env.fromCollection(List((1,3),(1,5),(1,7),(1,4),(1,2)))
      .keyBy(_._1)
      .flatMap(new RichFlatMapFunction[(Int,Int),(Int,Int)] {
        var state: AggregatingState[Int,Int] = _

        override def open(parameters: Configuration): Unit = {
          state = getRuntimeContext.getAggregatingState(new AggregatingStateDescriptor[Int,(Int,Int),Int]("agg",
            new AggregateFunction[Int,(Int,Int),Int] {
              override def add(value: Int, accumulator: (Int, Int)) = {
                (accumulator._1 + value,accumulator._2 + 1)
              }

              override def createAccumulator() = {
                (0,0)
              }

              override def getResult(accumulator: (Int, Int)) = {
                accumulator._1 / accumulator._2
              }

              override def merge(a: (Int, Int), b: (Int, Int)) = {
                (a._1 + b._1,a._2 + b._2)
              }
            },createTypeInformation[(Int,Int)]))
        }

        override def flatMap(value: (Int, Int), out: Collector[(Int, Int)]) = {
          state.add(value._2)
          val tuple2 = (value._1,state.get())
          out.collect(tuple2)
        }
      }).print().setParallelism(1)
    env.execute("AggregatingStateApp")
  }
}

運行結果

(1,3)
(1,4)
(1,5)
(1,4)
(1,4)

Checkpoint機制

Flink中的每一個算子都能成爲有狀態的,爲了使得狀態能夠容錯,持久化狀態,就有了Checkpoint機制。Checkpoint能夠恢復狀態以及在流中消費的位置,提供一種無故障執行的方式。

默認情況下,checkpoint機制是禁用的,需要我們手動開啓。

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
//開啓Checkpoint,間隔時間4秒進行一次Checkpoint
env.enableCheckpointing(4000);
//設置Checkpoint的模式,精準一次,也是Checkpoint默認的方式,適合大部分應用,
//還有一種CheckpointingMode.AT_LEAST_ONCE最少一次,一般用於超低延遲的場景
env.getCheckpointConfig().setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE);
//設置Checkpoint的超時時間,這裏是10秒
env.getCheckpointConfig().setCheckpointTimeout(10000);
//設置Checkpoint的併發數,可以1個,可以多個
env.getCheckpointConfig().setMaxConcurrentCheckpoints(1);
public class CheckpointApp {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.enableCheckpointing(5000);
        DataStreamSource<String> stream = env.socketTextStream("127.0.0.1", 9999);
        stream.map(x -> {
            if (x.contains("pk")) {
                throw new RuntimeException("出bug了...");
            }else {
                return x;
            }
        }).print().setParallelism(1);
        env.execute("CheckpointApp");
    }
}

按照一般的情況,如果我們沒有開啓nc -lk 9999,則程序會直接掛掉,但是我們這裏開啓了Checkpoint,此時雖然9999端口沒有開啓,但它會一直試圖連接9999端口,並不會掛掉,而Checkpoint的重試次數爲Integer.MAX_VALUE,所以我們會一直看到這樣的日誌

java.net.ConnectException: Connection refused (Connection refused)
	at java.net.PlainSocketImpl.socketConnect(Native Method)
	at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
	at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
	at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
	at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
	at java.net.Socket.connect(Socket.java:589)
	at org.apache.flink.streaming.api.functions.source.SocketTextStreamFunction.run(SocketTextStreamFunction.java:96)
	at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:94)
	at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:58)
	at org.apache.flink.streaming.runtime.tasks.SourceStreamTask.run(SourceStreamTask.java:99)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:300)
	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:704)
	at java.lang.Thread.run(Thread.java:748)

Scala代碼

import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.api.scala._

object CheckpointApp {
  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    env.enableCheckpointing(5000)
    val stream = env.socketTextStream("127.0.0.1",9999)
    stream.map(x => {
      if (x.contains("pk")) {
        throw new RuntimeException("出bug了...")
      } else {
        x
      }
    }).print().setParallelism(1)
    env.execute("CheckpointApp")
  }
}

重啓策略

就像我們剛纔看到的,如果不設置重啓策略,則Checkpoint會有一個默認的重啓策略,次數爲Integer.MAX_VALUE,延遲爲1秒。如果我們只想重啓兩次,就需要設置重啓策略,重啓策略的設置可以在Flink的配置文件中設置,也可以在代碼中設置

如在flink的conf目錄下編輯flink-conf.yaml

restart-strategy.fixed-delay.attempts: 2
restart-strategy.fixed-delay.delay: 5 s

失敗後重啓次數2,延遲時間間隔5秒

代碼中設置

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.enableCheckpointing(5000);
env.setRestartStrategy(RestartStrategies.fixedDelayRestart(2, Time.of(5, TimeUnit.SECONDS)));

這裏需要注意的是使用重啓策略,必須開啓Checkpoint機制,否則無效

public class CheckpointApp {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.enableCheckpointing(5000);
        env.setRestartStrategy(RestartStrategies.fixedDelayRestart(2, Time.of(5, TimeUnit.SECONDS)));
        DataStreamSource<String> stream = env.socketTextStream("127.0.0.1", 9999);
        stream.map(x -> {
            if (x.contains("pk")) {
                throw new RuntimeException("出bug了...");
            }else {
                return x;
            }
        }).print().setParallelism(1);
        env.execute("CheckpointApp");
    }
}

當我們打開nc -lk 9999,再運行該程序,當我們在控制檯輸出2次pk,程序雖然會拋出異常

java.lang.RuntimeException: 出bug了...
	at com.guanjian.flink.java.test.CheckpointApp.lambda$main$95f17bfa$1(CheckpointApp.java:18)
	at org.apache.flink.streaming.api.operators.StreamMap.processElement(StreamMap.java:41)
	at org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:202)
	at org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:105)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:300)
	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:704)
	at java.lang.Thread.run(Thread.java:748)

但不會掛掉,當我們輸入第三次pk的時候,程序就會徹底掛掉

Scala代碼

import java.util.concurrent.TimeUnit

import org.apache.flink.api.common.restartstrategy.RestartStrategies
import org.apache.flink.api.common.time.Time
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.api.scala._

object CheckpointApp {
  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    env.enableCheckpointing(5000)
    env.setRestartStrategy(RestartStrategies.fixedDelayRestart(2, Time.of(5, TimeUnit.SECONDS)))
    val stream = env.socketTextStream("127.0.0.1",9999)
    stream.map(x => {
      if (x.contains("pk")) {
        throw new RuntimeException("出bug了...")
      } else {
        x
      }
    }).print().setParallelism(1)
    env.execute("CheckpointApp")
  }
}

StateBackend

默認情況下,Checkpoint的State是存儲在內存中,一旦我們的程序掛掉了,重新啓動,那麼之前的狀態都會丟失,比方說之前我們在nc中輸入了

a,a,a

以之前的CheckpointApp來說,我們稍作修改

public class CheckpointApp {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.enableCheckpointing(5000);
        env.setRestartStrategy(RestartStrategies.fixedDelayRestart(2, Time.of(5, TimeUnit.SECONDS)));
        DataStreamSource<String> stream = env.socketTextStream("127.0.0.1", 9999);
        stream.map(x -> {
            if (x.contains("pk")) {
                throw new RuntimeException("出bug了...");
            }else {
                return x;
            }
        }).flatMap(new FlatMapFunction<String, Tuple2<String,Integer>>() {
            @Override
            public void flatMap(String value, Collector<Tuple2<String,Integer>> out) throws Exception {
                String[] splits = value.split(",");
                Stream.of(splits).forEach(token -> out.collect(new Tuple2<>(token,1)));
            }
        }).keyBy(0).sum(1)
                .print().setParallelism(1);
        env.execute("CheckpointApp");
    }
}

運行結果爲

(a,1)
(a,2)
(a,3)

這個是沒有問題的,現在一旦程序掛掉,再次啓動程序的時候,我們再做相同的處理,結果不變。

但如果我們並不希望這樣的結果,我們希望得到的結果是

(a,4)
(a,5)
(a,6)

 保留之前掛掉前的結果繼續累加

public class CheckpointApp {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.enableCheckpointing(5000);
        //非內存的外部擴展
        env.getCheckpointConfig().enableExternalizedCheckpoints(CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION);
        //State以文件方式存儲
        env.setStateBackend(new FsStateBackend("hdfs://172.18.114.236:8020/backend"));
        env.setRestartStrategy(RestartStrategies.fixedDelayRestart(2, Time.of(5, TimeUnit.SECONDS)));
        DataStreamSource<String> stream = env.socketTextStream("host1", 9999);
        stream.map(x -> {
            if (x.contains("pk")) {
                throw new RuntimeException("出bug了...");
            }else {
                return x;
            }
        }).flatMap(new FlatMapFunction<String, Tuple2<String,Integer>>() {
            @Override
            public void flatMap(String value, Collector<Tuple2<String,Integer>> out) throws Exception {
                String[] splits = value.split(",");
                Stream.of(splits).forEach(token -> out.collect(new Tuple2<>(token,1)));
            }
        }).keyBy(0).sum(1)
                .print().setParallelism(1);
        env.execute("CheckpointApp");
    }
}

pom中調整運行的主類

<transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
   <mainClass>com.guanjian.flink.java.test.CheckpointApp</mainClass>
</transformer>

打包上傳服務器flink的test的目錄

修改flink的conf目錄下的flink-conf.yaml,補充以下內容

state.backend: filesystem
state.checkpoints.dir: hdfs://172.18.114.236:8020/backend
state.savepoints.dir: hdfs://172.18.114.236:8020/backend

在HDFS中新建backend目錄

hdfs dfs -mkdir /backend

重啓Flink,開啓

nc -lk 9999

第一次提交方式不變

./flink run -m yarn-cluster -yn 1 ../test/flink-train-java-1.0.jar

繼續之前的輸入

a,a,a

此時停掉flink提交的程序,會在hdfs中發現一個很多數字的文件夾

現在我們再次啓動程序,不過跟之前有些不同

./flink run -s hdfs://172.18.114.236:8020/backend/4db93b564e17b3806230f7c2d053121e/chk-5 -m yarn-cluster -yn 1 ../test/flink-train-java-1.0.jar 

此時在nc中繼續輸入

a,a,a

運行結果就達到了我們的預期

Scala代碼

import java.util.concurrent.TimeUnit

import org.apache.flink.api.common.restartstrategy.RestartStrategies
import org.apache.flink.api.common.time.Time
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.api.scala._
import org.apache.flink.runtime.state.filesystem.FsStateBackend
import org.apache.flink.streaming.api.environment.CheckpointConfig

object CheckpointApp {
  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    env.enableCheckpointing(5000)
    env.getCheckpointConfig.enableExternalizedCheckpoints(CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION)
    env.setStateBackend(new FsStateBackend("hdfs://172.18.114.236:8020/backend"))
    env.setRestartStrategy(RestartStrategies.fixedDelayRestart(2, Time.of(5, TimeUnit.SECONDS)))
    val stream = env.socketTextStream("host1",9999)
    stream.map(x => {
      if (x.contains("pk")) {
        throw new RuntimeException("出bug了...")
      } else {
        x
      }
    }).flatMap(_.split(","))
      .map((_,1))
      .keyBy(0)
      .sum(1)
      .print().setParallelism(1)
    env.execute("CheckpointApp")
  }
}

RocksDBStateBackend

要使用RocksDBBackend需要先添加依賴

<dependency>
   <groupId>org.apache.flink</groupId>
   <artifactId>flink-statebackend-rocksdb_2.11</artifactId>
   <version>${flink.version}</version>
</dependency>
public class CheckpointApp {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.enableCheckpointing(5000);
        //非內存的外部擴展
        env.getCheckpointConfig().enableExternalizedCheckpoints(CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION);
        //State以RockDB數據庫存儲,並刷到hdfs上面去
        env.setStateBackend(new RocksDBStateBackend("hdfs://172.18.114.236:8020/backend/rocksDB",true));
        env.setRestartStrategy(RestartStrategies.fixedDelayRestart(2, Time.of(5, TimeUnit.SECONDS)));
        DataStreamSource<String> stream = env.socketTextStream("host1", 9999);
        stream.map(x -> {
            if (x.contains("pk")) {
                throw new RuntimeException("出bug了...");
            }else {
                return x;
            }
        }).flatMap(new FlatMapFunction<String, Tuple2<String,Integer>>() {
            @Override
            public void flatMap(String value, Collector<Tuple2<String,Integer>> out) throws Exception {
                String[] splits = value.split(",");
                Stream.of(splits).forEach(token -> out.collect(new Tuple2<>(token,1)));
            }
        }).keyBy(0).sum(1)
                .print().setParallelism(1);
        env.execute("CheckpointApp");
    }
}

打包上傳服務器flink的test目錄下

創建hdfs的目錄

hdfs dfs -mkdir /backend/rocksDB

配置flink的flink-conf.yaml,修改和添加以下內容

state.backend: rocksdb
state.checkpoints.dir: hdfs://172.18.114.236:8020/backend/rocksDB
state.savepoints.dir: hdfs://172.18.114.236:8020/backend/rocksDB
state.backend.incremental: true
state.backend.rocksdb.checkpoint.transfer.thread.num: 1
state.backend.rocksdb.localdir: /raid/db/flink/checkpoints
state.backend.rocksdb.timer-service.factory: HEAP

重啓Flink.執行

nc -lk 9999

第一次提交方式不變

./flink run -m yarn-cluster -yn 1 ../test/flink-train-java-1.0.jar

 繼續之前的輸入

a,a,a

 此時停掉flink提交的程序,會在hdfs中發現一個很多數字的文件夾

在某臺集羣服務器上,這裏只能說是某臺,不一定是你提交任務的那臺服務器,可以看到rocksdb的本地數據文件

rocksdbbackend是先將數據存儲到該處,再刷到hdfs中的

再次啓動程序

./flink run -s hdfs://172.18.114.236:8020/backend/rocksDB/6277c8adfba91c72baa384a0d23581d9/chk-64 -m yarn-cluster -yn 1 ../test/flink-train-java-1.0.jar

此時輸入

a,a,a

此時我們去觀察結果跟之前相同

Scala代碼

import java.util.concurrent.TimeUnit

import org.apache.flink.api.common.restartstrategy.RestartStrategies
import org.apache.flink.api.common.time.Time
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.api.scala._
import org.apache.flink.contrib.streaming.state.RocksDBStateBackend
import org.apache.flink.streaming.api.environment.CheckpointConfig

object CheckpointApp {
  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    env.enableCheckpointing(5000)
    env.getCheckpointConfig.enableExternalizedCheckpoints(CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION)
    env.setStateBackend(new RocksDBStateBackend("hdfs://172.18.114.236:8020/backend/rocksDB",true))
    env.setRestartStrategy(RestartStrategies.fixedDelayRestart(2, Time.of(5, TimeUnit.SECONDS)))
    val stream = env.socketTextStream("host1",9999)
    stream.map(x => {
      if (x.contains("pk")) {
        throw new RuntimeException("出bug了...")
      } else {
        x
      }
    }).flatMap(_.split(","))
      .map((_,1))
      .keyBy(0)
      .sum(1)
      .print().setParallelism(1)
    env.execute("CheckpointApp")
  }
}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章