数据挖掘_Python-Spark-Flink机器学习开发工具对比

不同的工具

在机器学习的常用工具中，一般的数据挖掘和数据统计分析的工具，是R语言和Python，大量的数据时候，使用的是Flink和Spark。
了解和熟悉工具的使用，对于一些数据进行探索和实现。
 本文主要是基于Python的数据挖掘和机器学习的流程，来对比Spark和Flink的机器学习包，进而通过使用其中的一种情况而熟悉其他，达到触类旁通的效果

Python

 一般流程： 获取数据 -> 数据预处理 -> 训练建模 -> 模型评估 -> 预测，分类
scikit-learn ：  NumPy  SciPy  matplotlib
  管道机制实现了对全部步骤的流式化封装和管理（streaming workflows with pipelines）
      许多算法模型串联起来，比如将特征提取、归一化、分类组织在一起形成一个典型的机器学习问题工作流 编程技巧的创新，而非算法的创新
     Transformer 转换器  Estimator 估计器  Pipeline 管道
  具体
     01.Transformer 转换器 (StandardScaler，MinMaxScaler)
     02.Estimator 估计器（LinearRegression、LogisticRegression、LASSO、Ridge），
        所有的机器学习算法模型，都被称为估计器
     03.Pipeline 管道 将Transformer、Estimator 组合起来成为一个大模型
    	 pipeline
        使用PipeLine对数据进行预处理组成新的模型
        直接调用fit和predict方法来对pipeline中的所有算法模型进行训练和预测
    	可以结合grid search对参数进行选择
 示例
     eg： from sklearn.pipeline import Pipeline
     过程：
      数据归一化(Data Normalization)  from sklearn import preprocessing
      特征选择(Feature Selection)     from sklearn.ensemble import ExtraTreesClassifier
      算法的使用                      from sklearn.linear_model import LogisticRegression
      优化算法参数                    from sklearn.grid_search import GridSearchCV 
     one-hot编码
	 数据集拆分
	 模型：
	  # 拟合模型
      model.fit(X_train, y_train)
     # 模型预测
      model.predict(X_test)    
     # 获得这个模型的参数
      model.get_params()
	 模型保存和载入
	  from sklearn.externals import joblib
	# 保存模型
	  joblib.dump(model, 'model.pickle')
	#载入模型
	  model = joblib.load('model.pickle')

Spark

1.基本概念

org.apache.spark.ml 
PipelineStage
A stage in a pipeline, either an [[Estimator]] or a [[Transformer]].
Transformer
transform one dataset into another.
Estimator
estimators that fit models to data.
Model
A fitted model, i.e., a [[Transformer]] produced by an [[Estimator]].
Pipeline
A Pipeline consists of a sequence of stages, each of which is either an [[Estimator]] or a [[Transformer]]

PipelineModel
 object PipelineModel extends MLReadable[PipelineModel]
Parameter 
 被用来设置 Transformer 或者 Estimator 的参数
VectorAssembler
   CrossValidatorModel
        Params for [[CrossValidator]] and [[CrossValidatorModel]].
		Spark提供在org.apache.spark.ml.tuning包下提供了模型选择器，可以替换参数然后比较模型输出

2.Spark 的 Dataset

randomSplit
Randomly splits this Dataset with the provided weights.

 randomSplitAsList
 Returns a Java list that contains randomly split Dataset with the provided weights.
输入： weights: Array[Double]
       weights: List[Double]
返回： Array[Dataset]or List
示例：
 正样本和负样本截取（样本数据过多的情况）
                       double[] weights = {pos_rate,1.0-pos_rate};
                       Dataset<Row>[] arr = posSet.randomSplit(weights);
                       posSet = arr[0];
  正样本和负样本均衡
//合并正负样本数据
                   Dataset<Row> dataUse = dataPos_sample.union(dataNeg_sample);   
// 定义 Pipeline 中的各个 PipelineStage ，如指标提取和转换模型训练等。
  有了这些处理特定问题的 Transformer 和 Estimator，
 我们就可以按照具体的处理逻辑来有序的组织 PipelineStages 并创建一个 Pipeline
 每个stage要么是一个Transformer，要么是一个Estimator。
 这些stage是按照顺序执行的，输入的dataframe当被传入每个stage的时候会被转换
 Pipeline pipeline = new Pipeline().setStages(Array(stage1,stage2,stage3,…))
 然后就可以把 训练数据集 作为入参并调用 Pipeline 实例的 fit 方法来开始以流的方式来处理源训练数据

//构建完成一个 stage piple
    Pipeline pipeline = new Pipeline().setStages(pipeArr);
	PipelineModel model = pipeline.fit(train_data);

    加载模型： PipelineModel model2 = PipelineModel.load(path);
 方式 获得 CrossValidator 的最佳模型参数 -- 通过交叉验证进行模型选择
  CrossValidator rf_cv = new CrossValidator().setEstimator(pipeline)
  CrossValidatorModel rf_model = rf_cv.fit(train_data);
    加载模型： CrossValidatorModel rf_model2 = CrossValidatorModel.load(path);
	  
 eg： // Chain indexers and tree in a Pipeline.
 Pipeline pipeline = new Pipeline()
  .setStages(new PipelineStage[]{labelIndexer, featureIndexer, dt, labelConverter});

Flink

1.Flink ML

PipelineStage 
    Base class for a stage in a pipeline，and does not have any actual functionality
    Its subclasses must be either Estimator or Transformer    
Transformer
       * A transformer is a {@link PipelineStage} that transforms an input {@link Table} to a result {@link Table}.   
Estimator
        Estimators are {@link PipelineStage}s responsible for training and generating machine learning models.
Model
       A model is an ordinary {@link Transformer} except how it is created.   
 Pipeline
       A pipeline is a linear workflow which chains {@link Estimator}s and {@link Transformer}s to execute an algorithm.
     can also be used as a {@link PipelineStage} in another pipeline
   
 Params WithParams  ParamInfoFactory  ParamInfo

2.Alink

com.alibaba.alink.pipeline
 Pipeline
     A pipeline is a linear workflow which chains {@link EstimatorBase}s and {@link TransformerBase}s to
  * execute an algorithm.
     public class Pipeline extends EstimatorBase<Pipeline, PipelineModel> 
 PipelineModel
      public class PipelineModel extends ModelBase<PipelineModel> implements LocalPredictable {
 PipelineStageBase
      The base class for a stage in a pipeline, either an [[EstimatorBase]] or a [[TransformerBase]].
 EstimatorBase
    public abstract class EstimatorBase<E extends EstimatorBase<E, M>, M extends ModelBase<M>> extends PipelineStageBase<E> implements Estimator<E, M>
 TransformerBase 
     public abstract class TransformerBase<T extends TransformerBase<T>>  extends PipelineStageBase<T> implements Transformer<T>
 VectorAssembler
     VectorAssembler is a transformer that combines a given list of columns

参考

源码