RDD算子之sample、takeSample源碼詳解

一、sample

1.描述

根據給定的隨機種子,從RDD中隨機地按指定比例選一部分記錄,創建新的RDD。返回RDD[T]

2.源碼
//返回此RDD的抽樣子集
defsample(withReplacement: Boolean, fraction: Double, seed: Long = Utils.random.nextLong): RDD[T]={
	require(fraction >= 0,s"Fraction must be nonnegative, but got ${fraction}")
	withScope {
       require(fraction >= 0.0, "Negative fraction value: " + fraction)
       if (withReplacement) {
          new PartitionwiseSampledRDD[T, T](this, new PoissonSampler[T](fraction), true, seed)
       }else {
          new PartitionwiseSampledRDD[T, T](this, new BernoulliSampler[T](fraction), true, seed)
       }
   }
}
  • 參數
withReplacement:是否放回抽樣。true-有放回,false-無放回
fraction:期望樣本的大小作爲RDD大小的一部分
      當withReplacement=false時,選擇每個元素的概率,分數一定是[0,1]
      當withReplacement=true時,選擇每個元素的期望次數,分數必須大於等於0
seed:隨機數生成器的種子。一般默認
3.例子
  • 無放回抽樣,每個元素被抽到的概率爲0.5:fraction=0.5
val rdd=sc.parallelize(List(2,3,7,4,8))
val sampleRdd=rdd.sample(false,0.5)
sampleRdd.foreach(println)
  • 有放回抽樣,每個元素被抽取到的期望次數是2:fraction=2
//簡單1--(有/無放回抽樣,抽樣比例,隨機數種子)
val rdd=sc.parallelize(List(2,3,7,4,8))
val sampleRdd=rdd.sample(true,2)
sampleRdd.foreach(println)

二、takeSample

1.描述

返回此RDD的固定大小的採樣子集。返回Array[T]
注意:僅當預期結果數組較小時才應使用此方法,因爲所有數據均已加載到驅動程序的內存中

2.源碼
def takeSample(withReplacement:Boolean, num:Int, seed:Long=Utils.random.nextLong): Array[T] = withScope {
    val numStDev = 10.0
    require(num >= 0, "Negative number of elements requested")
    require(num <= (Int.MaxValue - (numStDev * math.sqrt(Int.MaxValue)).toInt),
      "Cannot support a sample size > Int.MaxValue - " +
      s"$numStDev * math.sqrt(Int.MaxValue)")

    if (num == 0) {
      new Array[T](0)
    } else {
      val initialCount = this.count()
      if (initialCount == 0) {
        new Array[T](0)
      }
      else
      {
        val rand = new Random(seed)
        if (!withReplacement && num >= initialCount) {
          Utils.randomizeInPlace(this.collect(), rand)
        }
        else
        {
          val fraction = SamplingUtils.computeFractionForSampleSize(num, initialCount,
            withReplacement)
          var samples = this.sample(withReplacement, fraction, rand.nextInt()).collect()
          var numIters = 0
          while (samples.length < num) {
            logWarning(s"Needed to re-sample due to insufficient sample size. Repeat #$numIters")
            samples = this.sample(withReplacement, fraction, rand.nextInt()).collect()
            numIters += 1
          }
          Utils.randomizeInPlace(samples, rand).take(num)
        }
      }
    }
  }
  • 參數
withReplacement:是否放回抽樣。true-有放回,false-無放回
num:返回樣本的大小
seed:隨機數生成器的種子。一般默認
3.例子
  • 無放回抽樣,樣本個數 > 父本個數,返回父本個數
val rdd=sc.parallelize(List(2,3,7,4,8))
val sampleRdd=rdd.takeSample(false,10)
sampleRdd.foreach(println)
  • 無放回抽樣,樣本個數 <= 父本個數,返回樣本個數
val rdd=sc.parallelize(List(2,3,7,4,8))
val sampleRdd=rdd.takeSample(false,3)
sampleRdd.foreach(println)
  • 有放回抽樣,返回樣本個數
val rdd=sc.parallelize(List(2,3,7,4,8))
val sampleRdd=rdd.takeSample(false,3)
sampleRdd.foreach(println)
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章