spark sample

sample算子通常用于抽样,是一个transformation算子

参数:withReplacement=true代表有放回抽样

参数:fraction 代表抽样的比例

使用:

  data.sample(withReplacement=true,fraction = 0.5).collect().foreach(println(_))

源码:

def sample(
      withReplacement: Boolean,
      fraction: Double,
      seed: Long = Utils.random.nextLong): RDD[T] = {
    require(fraction >= 0,
      s"Fraction must be nonnegative, but got ${fraction}")

    withScope {
      require(fraction >= 0.0, "Negative fraction value: " + fraction)
      if (withReplacement) {
        new PartitionwiseSampledRDD[T, T](this, new PoissonSampler[T](fraction), true, seed)
      } else {
        new PartitionwiseSampledRDD[T, T](this, new BernoulliSampler[T](fraction), true, seed)
      }
    }
  }
原文地址:https://www.cnblogs.com/students/p/14230285.html