spark讀取MySQL的方式及併發度優化

原創

2019-07-08 06:25

前段時間用sparksession讀取MySQL的一個表的時候,出現耗時長,頻繁出現oom等情況,去網上查找了一下,是因爲用的默認讀取jdbc方式,單線程任務重,所以出現耗時長,oom等現象.這時候需要提高讀取的併發度.現簡單記錄下.
看sparsession DataFrameReader源碼,讀取jdbc有三個方法重載.

單partition,無併發def jdbc(url: String, table: String, properties: Properties): DataFrame
使用,校驗

	val url: String = "jdbc:mysql://localhost:3306/testdb"
    val table = "students"
    //連接參數
    val properties: Properties = new Properties()
    properties.setProperty("username","root")
    properties.setProperty("password","123456")
    properties.setProperty("driver","com.mysql.jdbc.Driver")
    val tb_table: DataFrame = sparkSession.read.jdbc(url, table, properties)

查看併發度tb_table.rdd.getNumPartitions #返回結果1
該操作的併發度爲1，你所有的數據都會在一個partition中進行操作，意味着無論你給的資源有多少，只有一個task會執行任務，執行效率可想而之，並且在稍微大點的表中進行操作分分鐘就會OOM.
2. 根據Long類型字段分區
調用函數

  def jdbc(
  url: String,
  table: String,
  columnName: String,    # 根據該字段分區，需要爲整形，比如id等
  lowerBound: Long,      # 分區的下界
  upperBound: Long,      # 分區的上界
  numPartitions: Int,    # 分區的個數
  connectionProperties: Properties): DataFrame

使用,校驗

	val url: String = "jdbc:mysql://localhost:3306/testdb"
    val table = "students"
    val colName: String = "id"
    val lowerBound = 1
    val upperBound = 10000
    val numPartions = 10
    val properties: Properties = new Properties()
    properties.setProperty("username","root")
    properties.setProperty("password","123456")
    properties.setProperty("driver","com.mysql.jdbc.Driver")
    val tb_table: DataFrame = sparkSession.read.jdbc(url, table,colName, lowerBound, upperBound, numPartions, properties)

查看併發度tb_table.rdd.getNumPartitions #返回結果10
該操作將字段 colName 中1-10000000條數據分到10個partition中，使用很方便，缺點也很明顯，只能使用整形數據字段作爲分區關鍵字
3. 根據任意類型字段分區
調用函數

jdbc(
  url: String,
  table: String,
  predicates: Array[String],
  connectionProperties: Properties): DataFrame

使用,校驗

	val url: String = "jdbc:mysql://localhost:3306/testdb"
    val table = "students"
    /**
	* 將9月16-12月15三個月的數據取出，按時間分爲6個partition
	* 爲了減少事例代碼，這裏的時間都是寫死的
	* sbirthday 爲時間字段
	*/
    val predicates =
      Array(
        "2015-09-16" -> "2015-09-30",
        "2015-10-01" -> "2015-10-15",
        "2015-10-16" -> "2015-10-31",
        "2015-11-01" -> "2015-11-14",
        "2015-11-15" -> "2015-11-30",
        "2015-12-01" -> "2015-12-15"
      ).map {
        case (start, end) =>
          s"cast(sbirthday as date) >= date '$start' " + s"AND cast(sbirthday as date) <= date '$end'"
      }
    val properties: Properties = new Properties()
    properties.setProperty("username","root")
    properties.setProperty("password","123456")
    properties.setProperty("driver","com.mysql.jdbc.Driver")
    val tb_table: DataFrame = sparkSession.read.jdbc(url, table,predicates, properties)

查看併發度 tb_table.rdd.getNumPartitions #結果爲6

該操作的每個分區數據都由該段時間的分區組成，這種方式適合各種場景，較爲推薦。
MySQL單partition,大表極容易出現卡死n分鐘oom情況.
分成多個partition後,已極大情況避免該情況發生,但是partition設置過高,大量partition同時讀取數據庫,也可能將數據庫弄掛,需要注意.

參考: spark jdbc(mysql) 讀取併發度優化

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

spark讀取MySQL的方式及併發度優化

Scala List的一些常用方法

HBase完全分佈式搭建

Azkaban的簡介和安裝(3.47.0版本,兩個服務模式安裝)

sqoop從mysql導入數據到hive時tinyint字段自動變成Boolean解決方案

Linux正則表達式基礎入門+擴展

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結