CountOnce

問題闡述:已知一個數組,數組中只有一個數據是出現一遍的,其他數據都是出現兩遍,將出現一次的數據找出。
1.實例描述
輸入爲3個文件:
1.txt 內容爲:
1,2,1,3,3
2.txt :
4,5,4,6,5
3.txt :
6,7,8,8,7

2.設計思路
   利用異或運算將列表中的所有ID異或,之後得到的值即爲所求ID。先將每個分區的數據異或,然後將結果進行異或運算

3.代碼示例

package com.fly.spark
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.SparkContext._
import scala.collection.mutable._

object MapPartitionDemo {
  def lineXor(line:String)={
    val array=line.trim.split(",")
    var temp=array(0).toInt
    for(i<-1 until array.length){
      temp^=array(i).toInt
    }
    temp
  }
  def myfunc(iter: Iterator[String]) : Iterator[(Int, Int)] = {
    var temp =lineXor(iter.next().toString)
    while (iter.hasNext)
    {
      temp^=lineXor(iter.next().toString)
    }
    Seq((1,temp)).iterator
  }
 
  def main(args: Array[String]) {
    val conf = new SparkConf().setAppName("MapPartitionDemo").setMaster("local[1]")
    val sc = new SparkContext(conf)
    val data=sc.textFile("hdfs://master:9000/xor")
    val result=data.mapPartitions(myfunc).reduceByKey(_^_)
    val lastResult=result.collect()
    println(lastResult(0))
  }
}

4.運行結果
(1,2)

5.程序說明
此處也可以用map,但map和mapPartitions還是有一定的區別的,網上有如下解釋
As a note, a presentation provided by a speaker at the 2013 San Francisco Spark Summit (goo.gl/JZXDCR) highlights that tasks with high per-record overhead perform better with a mapPartition than with a map transformation. This is, according to the presentation, due to the high cost of setting up a new task.
That said, not sure if there is a difference in parallel execution and memory usage between map and mapPartitions. For instance, map could work in parallel implicitly, mapPartitions forces you to iterate. Thus computation could be faster with map but if your execution on a single tuple uses a lot of temporary memory, mapPartitions could avoid GC and memory issues. No idea if this is the way it actually works, but my anecdotal evidence seems to imply this. Would love to have confirmation.

當然兩者在使用上有明顯區別,map是對rdd所有分區一個一個元素的操作,而mapPartitions是對rdd每個分區進行操作
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章