mllib文档笔记1

spark.mllib contains the original API built on top of RDDs.

spark.mllib 包含原始API构建于RDD之上。
spark.ml provides higher-level API built on top of DataFrames for constructing ML pipelines.

提供高级API构建于ML管道结构的DATAFrames之上

MLlib supports local vectors and matrices stored on a single machine

MLlib支持局部矢量和矩阵存储在一个单独机器上

1、数据类型

1）局部向量（Local vector）

稀疏向量（sparse vector）

稠密向量（dense vector）

import org.apache.spark.mllib.linalg.Vector;
import org.apache.spark.mllib.linalg.Vectors;

// Create a dense vector (1.0, 0.0, 3.0).
Vector dv = Vectors.dense(1.0, 0.0, 3.0);
// Create a sparse vector (1.0, 0.0, 3.0) by specifying its indices and values corresponding to nonzero entries.

//创建一个稀疏向量通过指定它的索引和值对应于相应的非零值
Vector sv = Vectors.sparse(3, new int[] {0, 2}, new double[] {1.0, 3.0});

2)标记点（Labeled point）

A labeled point is a local vector, either dense or sparse, associated with a label/response. In MLlib, labeled points are used in supervised learning algorithms. We use a double to store a label, so we can use labeled points in both regression and classification

用于监督学习算法，用一个双精度浮点值去存储一个标签，我们可以在回归和分类中用标记点

Sparse data

label index1:value1 index2:value2 ...

where the indices are one-based and in ascending order. After loading, the feature indices are converted to zero-based.

这是索引是基于1递增的序列，被加载后，这个值将被转变成基于0的开始的序列

例子：

MLUtils.loadLibSVMFile reads training examples stored in LIBSVM format.

Refer to the MLUtils Java docs for details on the API.

import org.apache.spark.mllib.regression.LabeledPoint;
import org.apache.spark.mllib.util.MLUtils;
import org.apache.spark.api.java.JavaRDD;

JavaRDD<LabeledPoint> examples =
MLUtils.loadLibSVMFile(jsc.sc(), "data/mllib/sample_libsvm_data.txt").toJavaRDD();

2、局部矩阵（Local matrix）

稠密矩阵（dense matrix）：is stored in a one-dimensional array and the matrix size, in column-major order.存储一个一维的向量和矩阵的大小（行、列），而且以列为主要顺序。

稀疏矩阵（sparse matrix）：Compressed Sparse Column (CSC) 压缩稀疏列

eg:

import org.apache.spark.mllib.linalg.Matrix;
import org.apache.spark.mllib.linalg.Matrices;

// Create a dense matrix ((1.0, 2.0), (3.0, 4.0), (5.0, 6.0))

//Matrices.dense(行数，列数，值)
Matrix dm = Matrices.dense(3, 2, new double[] {1.0, 3.0, 5.0, 2.0, 4.0, 6.0});

// Create a sparse matrix ((9.0, 0.0), (0.0, 8.0), (0.0, 6.0))

Matrices.sparse(行数，列数，行号，列号，值)
Matrix sm = Matrices.sparse(3, 2, new int[] {0, 1, 3}, new int[] {0, 2, 1}, new double[] {9, 6, 8});