Computing with scikit-learn of sklearn

Computing with scikit-learn

https://scikit-learn.org/stable/computing.html

此章讲解使用sklearn涉及到的计算性能相关问题。

Strategies to scale computationally: bigger data

对于一些应用，面对大量样本，或者特征，或者对速度有要求，这对传统的方法形成挑战。

For some applications the amount of examples, features (or both) and/or the speed at which they need to be processed are challenging for traditional approaches. In these cases scikit-learn has a number of options you can consider to make your system scale.

Scaling with instances using out-of-core learning

https://scikit-learn.org/stable/computing/scaling_strategies.html

对于大数据情况，可以考虑在线学习技术，当然需求模型支持。

传统模型，需要对样本一次性加载到内存，然后循环对样本做训练。

在线学习技术：

（1）流水化实例输入

（2）特征提取，也需要支持，这种提取方法是面向实例的，不能面向特征，例如逆文档频率，这个特征提取技术，就不适应。

（3）增量学习算法。

Out-of-core (or “external memory”) learning is a technique used to learn from data that cannot fit in a computer’s main memory (RAM).

Here is a sketch of a system designed to achieve this goal:

a way to stream instances

a way to extract features from instances

an incremental algorithm

如下是支持增量学习的算法，对应接口是 partial_fit

Here is a list of incremental estimators for different tasks:

Classification

sklearn.naive_bayes.MultinomialNB

sklearn.naive_bayes.BernoulliNB

sklearn.linear_model.Perceptron

sklearn.linear_model.SGDClassifier

sklearn.linear_model.PassiveAggressiveClassifier

sklearn.neural_network.MLPClassifier

Regression

sklearn.linear_model.SGDRegressor

sklearn.linear_model.PassiveAggressiveRegressor

sklearn.neural_network.MLPRegressor

Clustering

sklearn.cluster.MiniBatchKMeans

sklearn.cluster.Birch

Decomposition / feature Extraction

sklearn.decomposition.MiniBatchDictionaryLearning

sklearn.decomposition.IncrementalPCA

sklearn.decomposition.LatentDirichletAllocation

Preprocessing

sklearn.preprocessing.StandardScaler

sklearn.preprocessing.MinMaxScaler

sklearn.preprocessing.MaxAbsScaler

Computational Performance

https://scikit-learn.org/stable/computing/computational_performance.html

模型训练是耗时的，但是模型的预测性能对于应用是至关重要的。

模型训练性能也是需要考虑的，但是其并不在生产环境发生，对应用影响很小。

所以生产环境的模型的两个关键性能：

（1）预测的延时

（2）预测的吞吐

For some applications the performance (mainly latency and throughput at prediction time) of estimators is crucial. It may also be of interest to consider the training throughput but this is often less important in a production setup (where it often takes place offline).

We will review here the orders of magnitude you can expect from a number of scikit-learn estimators in different contexts and provide some tips and tricks for overcoming performance bottlenecks.

Prediction latency is measured as the elapsed time necessary to make a prediction (e.g. in micro-seconds). Latency is often viewed as a distribution and operations engineers often focus on the latency at a given percentile of this distribution (e.g. the 90 percentile).

Prediction throughput is defined as the number of predictions the software can deliver in a given amount of time (e.g. in predictions per second).

An important aspect of performance optimization is also that it can hurt prediction accuracy. Indeed, simpler models (e.g. linear instead of non-linear, or with fewer parameters) often run faster but are not always able to take into account the same exact properties of the data as more complex ones.

预测延迟跟以下几个方面相关。

One of the most straight-forward concerns one may have when using/choosing a machine learning toolkit is the latency at which predictions can be made in a production environment.

The main factors that influence the prediction latency are

Number of features

Input data representation and sparsity

Model complexity

Feature extraction

预测吞吐量，线性模型最高，支持向量其次，随机森林最少。

Another important metric to care about when sizing production systems is the throughput i.e. the number of predictions you can make in a given amount of time. Here is a benchmark from the Prediction Latency example that measures this quantity for a number of estimators on synthetic data:

These throughputs are achieved on a single process. An obvious way to increase the throughput of your application is to spawn additional instances (usually processes in Python because of the GIL) that share the same model. One might also add machines to spread the load. A detailed explanation on how to achieve this is beyond the scope of this documentation though.

Parallelism, resource management, and configuration

https://scikit-learn.org/stable/computing/parallelism.html

加速模型训练需要考虑并行机制，将耗时的运算分摊到多个CPU核上。 sklearn有以下两个组件：

（1）joblib库，进程或者线程的数目，可以通过 n_jobs 参数控制。模型训练的并行。

（2）通过 OpenMP 在底层算法（算数运算）的并行，例如numpy实现的一些代数运算。

Some scikit-learn estimators and utilities can parallelize costly operations using multiple CPU cores, thanks to the following components:

via the joblib library. In this case the number of threads or processes can be controlled with the n_jobs parameter.

via OpenMP, used in C or Cython code.

In addition, some of the numpy routines that are used internally by scikit-learn may also be parallelized if numpy is installed with specific numerical libraries such as MKL, OpenBLAS, or BLIS.

We describe these 3 scenarios in the following subsections.

Joblib-based parallelism

joblib支持多进程和多线程。

由于GIL锁的原因，很多场合，使用多线程是合适的。

When the underlying implementation uses joblib, the number of workers (threads or processes) that are spawned in parallel can be controlled via the n_jobs parameter.

Joblib is able to support both multi-processing and multi-threading. Whether joblib chooses to spawn a thread or a process depends on the backend that it’s using.

Scikit-learn generally relies on the loky backend, which is joblib’s default backend. Loky is a multi-processing backend. When doing multi-processing, in order to avoid duplicating the memory in each process (which isn’t reasonable with big datasets), joblib will create a memmap that all processes can share, when the data is bigger than 1MB.

In some specific cases (when the code that is run in parallel releases the GIL), scikit-learn will indicate to joblib that a multi-threading backend is preferable.

As a user, you may control the backend that joblib will use (regardless of what scikit-learn recommends) by using a context manager:

通过如下接口，控制并行机制的后台。线程或者进程。和并行数目。

from joblib import parallel_backend

with parallel_backend('threading', n_jobs=2):
    # Your scikit-learn code here

OpenMP-based parallelism

开放多线程是一套规范，在多种平台上被支持，旨在从编译器层面入手，规范语言中的关于线程并行的规范。

OpenMP is used to parallelize code written in Cython or C, relying on multi-threading exclusively. By default (and unless joblib is trying to avoid oversubscription), the implementation will use as many threads as possible.

https://en.wikipedia.org/wiki/OpenMP

The application programming interface (API) OpenMP (Open Multi-Processing) supports multi-platform shared-memory multiprocessing programming in C, C++, and Fortran,^[3] on many platforms, instruction-set architectures and operating systems, including Solaris, AIX, HP-UX, Linux, macOS, and Windows. It consists of a set of compiler directives, library routines, and environment variables that influence run-time behavior.^[2]^[4]^[5]

https://zhuanlan.zhihu.com/p/269612128

OpenMP是由OpenMP Architecture Review Board牵头提出的，并已被广泛接受，用于共享内存并行系统的多处理器程序设计的一套指导性编译处理方案(Compiler Directive) 。

支持OpenMP的编程语言：C、C++和Fortran；本文仅针对C++讨论。

支持OpenMP的编译器：GNU Compiler、Intel Compiler等。详情参考：https://www.openmp.org/resources/openmp-compilers-tools/。

OpenMp提供了对并行算法的高层的抽象描述，程序员通过在源代码中加入专用的pragma来指明自己的意图，由此编译器可以自动将程序进行并行化，并在必要之处加入同步互斥以及通信。当选择忽略这些pragma，或者编译器不支持OpenMp时，程序又可退化为通常的程序(一般为串行），代码仍然可以正常运作，只是不能利用多线程来加速程序执行。

2.OpenMP运行模式

OpenMP运行模式示意图

dask as backend

https://examples.dask.org/machine-learning/scale-scikit-learn.html

使用分布式技术，将后台荷载分不到不同机器上。

That would use the default joblib backend (multiple processes) for parallelism. To use the Dask distributed backend, which will use a cluster of machines to train the model, perform the fit in a parallel_backend context.

import joblib

with joblib.parallel_backend('dask'):
    grid_search.fit(data.data, data.target)

ray as backend

使用分布式技术，将后台荷载分不到不同机器上。

https://pypi.org/project/ray/

Ray provides a simple, universal API for building distributed applications.

Ray is packaged with the following libraries for accelerating machine learning workloads:

Tune: Scalable Hyperparameter Tuning

RLlib: Scalable Reinforcement Learning

RaySGD: Distributed Training Wrappers

Ray Serve: Scalable and Programmable Serving

There are also many community integrations with Ray, including Dask, MARS, Modin, Horovod, Hugging Face, Scikit-learn, and others. Check out the full list of Ray distributed libraries here.

Install Ray with: pip install ray. For nightly wheels, see the Installation page.

https://medium.com/distributed-computing-with-ray/easy-distributed-scikit-learn-training-with-ray-54ff8b643b33

Distributed Scikit-Learn with Ray

Scikit-learn parallelizes training on a single node using the joblib parallel backends. Joblib instantiates jobs that run on multiple CPU cores. The parallelism of these jobs is limited by the number of CPU cores available on that node. The current implementation of joblib is optimized for a single node, but why not go further and distribute it on multiple nodes?

Running distributed applications on multiple nodes introduces a host of new complexities like scheduling tasks across multiple machines, transferring data efficiently, and recovering from machine failures. Ray handles all of these details while keeping things simple.

Ray is a fast and simple framework for building and running distributed applications. Ray also provides many libraries for accelerating machine learning workloads. If your scikit-learn code takes too long to run and has a high degree of parallelism, using the Ray joblib backend could help you to seamlessly speed up your code from your laptop to a remote cluster by adding four lines of code that register and specify the Ray backend.

import numpy as np
from joblib import parallel_backend # added line.
from sklearn.datasets import load_digits
from sklearn.model_selection import RandomizedSearchCV
from sklearn.svm import SVC      

from ray.util.joblib import register_ray # added line.
register_ray() # added line.

param_space = {
    'C': np.logspace(-6, 6, 30),
    'gamma': np.logspace(-8, 8, 30),
    'tol': np.logspace(-4, -1, 30),
    'class_weight': [None, 'balanced'],
}

model = SVC(kernel='rbf')
search = RandomizedSearchCV(model, param_space, cv=5, n_iter=300,verbose=1)
digits = load_digits()
#ray.init(address=”auto”)
with parallel_backend('ray'): # added line.
    search.fit(digits.data, digits.target)

出处：http://www.cnblogs.com/lightsong/ 本文版权归作者和博客园共有,欢迎转载,但未经作者同意必须保留此段声明,且在文章页面明显位置给出原文连接。