一、sample
1.描述
根據給定的隨機種子,從RDD中隨機地按指定比例選一部分記錄,創建新的RDD。返回RDD[T]
2.源碼
//返回此RDD的抽樣子集
defsample(withReplacement: Boolean, fraction: Double, seed: Long = Utils.random.nextLong): RDD[T]={
require(fraction >= 0,s"Fraction must be nonnegative, but got ${fraction}")
withScope {
require(fraction >= 0.0, "Negative fraction value: " + fraction)
if (withReplacement) {
new PartitionwiseSampledRDD[T, T](this, new PoissonSampler[T](fraction), true, seed)
}else {
new PartitionwiseSampledRDD[T, T](this, new BernoulliSampler[T](fraction), true, seed)
}
}
}
- 參數
withReplacement:是否放回抽樣。true-有放回,false-無放回
fraction:期望樣本的大小作爲RDD大小的一部分
當withReplacement=false時,選擇每個元素的概率,分數一定是[0,1]
當withReplacement=true時,選擇每個元素的期望次數,分數必須大於等於0
seed:隨機數生成器的種子。一般默認
3.例子
- 無放回抽樣,每個元素被抽到的概率爲0.5:fraction=0.5
val rdd=sc.parallelize(List(2,3,7,4,8))
val sampleRdd=rdd.sample(false,0.5)
sampleRdd.foreach(println)
- 有放回抽樣,每個元素被抽取到的期望次數是2:fraction=2
//簡單1--(有/無放回抽樣,抽樣比例,隨機數種子)
val rdd=sc.parallelize(List(2,3,7,4,8))
val sampleRdd=rdd.sample(true,2)
sampleRdd.foreach(println)
二、takeSample
1.描述
返回此RDD的固定大小的採樣子集。返回Array[T]
注意:僅當預期結果數組較小時才應使用此方法,因爲所有數據均已加載到驅動程序的內存中
2.源碼
def takeSample(withReplacement:Boolean, num:Int, seed:Long=Utils.random.nextLong): Array[T] = withScope {
val numStDev = 10.0
require(num >= 0, "Negative number of elements requested")
require(num <= (Int.MaxValue - (numStDev * math.sqrt(Int.MaxValue)).toInt),
"Cannot support a sample size > Int.MaxValue - " +
s"$numStDev * math.sqrt(Int.MaxValue)")
if (num == 0) {
new Array[T](0)
} else {
val initialCount = this.count()
if (initialCount == 0) {
new Array[T](0)
}
else
{
val rand = new Random(seed)
if (!withReplacement && num >= initialCount) {
Utils.randomizeInPlace(this.collect(), rand)
}
else
{
val fraction = SamplingUtils.computeFractionForSampleSize(num, initialCount,
withReplacement)
var samples = this.sample(withReplacement, fraction, rand.nextInt()).collect()
var numIters = 0
while (samples.length < num) {
logWarning(s"Needed to re-sample due to insufficient sample size. Repeat #$numIters")
samples = this.sample(withReplacement, fraction, rand.nextInt()).collect()
numIters += 1
}
Utils.randomizeInPlace(samples, rand).take(num)
}
}
}
}
- 參數
withReplacement:是否放回抽樣。true-有放回,false-無放回
num:返回樣本的大小
seed:隨機數生成器的種子。一般默認
3.例子
- 無放回抽樣,樣本個數 > 父本個數,返回父本個數
val rdd=sc.parallelize(List(2,3,7,4,8))
val sampleRdd=rdd.takeSample(false,10)
sampleRdd.foreach(println)
- 無放回抽樣,樣本個數 <= 父本個數,返回樣本個數
val rdd=sc.parallelize(List(2,3,7,4,8))
val sampleRdd=rdd.takeSample(false,3)
sampleRdd.foreach(println)
- 有放回抽樣,返回樣本個數
val rdd=sc.parallelize(List(2,3,7,4,8))
val sampleRdd=rdd.takeSample(false,3)
sampleRdd.foreach(println)