Spark MLlib编程API入门系列之特征选择之R模型公式(RFormula)

不多说,直接上干货!

  特征选择里,常见的有:VectorSlicer(向量选择) RFormula(R模型公式) ChiSqSelector(卡方特征选择)。

  RFormula用于将数据中的字段通过R语言的Model Formulae转换成特征值,输出结果为一个特征向量和Double类型的label。关于R语言Model Formulae的介绍可参考:https://stat.ethz.ch/R-manual/R-devel/library/stats/html/formula.html

代码编写

  RFormula.scala

package zhouls.bigdata.DataFeatureSelection


import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.ml.feature.RFormula//引入ml里的特征选择的RFormula算法

 
/**
 * By  zhouls
 */
object RFormula extends App {
  
    val conf = new SparkConf().setMaster("local").setAppName("RFormula")
    val sc = new SparkContext(conf)
    
    val sqlContext = new org.apache.spark.sql.SQLContext(sc)
    import sqlContext.implicits._
    
    //构造数据集
    val dataset = sqlContext.createDataFrame(Seq(
      (7, "US", 18, 1.0),
      (8, "CA", 12, 0.0),
      (9, "NZ", 15, 0.0)
    )).toDF("id", "country", "hour", "clicked")//导入到DataFrame
    dataset.select("id", "country", "hour", "clicked").show()
    
    //当需要通过country和hour来预测clicked时候,
    //构造RFormula,指定Formula表达式为clicked ~ country + hour
    val formula = new RFormula().setFormula("clicked ~ country + hour").setFeaturesCol("features").setLabelCol("label")
    //生成特征向量及label
    val output = formula.fit(dataset).transform(dataset)
    output.select("id", "country", "hour", "clicked", "features", "label").show()
 
}

   由

  变成

 

原文地址:https://www.cnblogs.com/zlslch/p/7396185.html