spark編程模型（九）之RDD基礎轉換操作（Transformation Operation）——mapPartitions、mapPartitionsWithIndex

mapPartitions():

def mapPartitions[U](f: (Iterator[T]) => Iterator[U], preservesPartitioning: Boolean = false)(implicit arg0: ClassTag[U]): RDD[U]
該函數和map函數類似，只不過映射函數的參數由RDD中的每一個元素變成了RDD中每一個分區的迭代器。如果在映射的過程中需要頻繁創建額外的對象，使用mapPartitions要比map高效
比如，將RDD中的所有數據通過JDBC連接寫入數據庫，如果使用map函數，可能要爲每一個元素都創建一個connection，這樣開銷很大，如果使用mapPartitions，那麼只需要針對每一個分區建立一個connection

參數preservesPartitioning表示是否保留父RDD的partitioner分區信息

var rdd1 = sc.makeRDD(1 to 5,2)

//rdd1有兩個分區
scala> var rdd3 = rdd1.mapPartitions{ x => {
     | var result = List[Int]()
     |     var i = 0
     |     while(x.hasNext){
     |       i += x.next()
     |     }
     |     result.::(i).iterator
     | }}
rdd3: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[84] at mapPartitions at :23

//rdd3將rdd1中每個分區中的數值累加
scala> rdd3.collect
res65: Array[Int] = Array(3, 12)

scala> rdd3.partitions.size
res66: Int = 2

mapPartitionsWithIndex():

def mapPartitionsWithIndex[U](f: (Int, Iterator[T]) => Iterator[U], preservesPartitioning: Boolean = false)(implicit arg0: ClassTag[U]): RDD[U]

函數作用同mapPartitions，不過提供了兩個參數，第一個參數爲分區的索引

var rdd1 = sc.makeRDD(1 to 5,2)

//rdd1有兩個分區
var rdd2 = rdd1.mapPartitionsWithIndex{
        (x,iter) => {
          var result = List[String]()
            var i = 0
            while(iter.hasNext){
              i += iter.next()
            }
            result.::(x + "|" + i).iterator

        }
      }

//rdd2將rdd1中每個分區的數字累加，並在每個分區的累加結果前面加了分區索引
scala> rdd2.collect
res13: Array[String] = Array(0|3, 1|12)

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

spark編程模型（九）之RDD基礎轉換操作（Transformation Operation）——mapPartitions、mapPartitionsWithIndex

mapPartitions():

mapPartitionsWithIndex():

Power Automate Desktop 安裝完，登錄後老是提示one driver 錯誤

再談23種設計模式（3）：行爲型模式（學習筆記）

微前端學習筆記(4):從微前端到微模塊之EMP與hel-micro方案探索

微前端學習筆記（1）：微前端總體架構概述，從微服務發微

985 碩士程序員，空窗 4 個月沒有 Offer！

一文搞懂 Spring 循環依賴

賽博鬥地主——使用大語言模型扮演Agent智能體玩牌類遊戲。

VScode右鍵打開(添加到右鍵)

記一次 .NET某工控視覺自動化系統卡死分析

WindowsServer--SQL Server搭建主從同步實現讀寫分離 - 事務性分發

那積滿灰塵的機械鍵盤，該重新拿起了

spark編程模型（二十二）之RDD存儲行爲操作（Action Operation）——saveAsTextFile、saveAsSequenceFile、saveAsObjectFile

spark自定義分區實例

Hive 與 SparkSQL 整合

spark編程模型（十八）之RDD集合標量行爲操作（Action Operation）——first、count、reduce、collect

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結