Out-of-core classification of text documents of sklearn

Strategies to scale computationally: bigger data

https://scikit-learn.org/stable/computing/scaling_strategies.html

针对海量样本和计算速度的要求，对于传统的方法（数据加载内存 - 训练 -- 应用）提出了挑战。

For some applications the amount of examples, features (or both) and/or the speed at which they need to be processed are challenging for traditional approaches. In these cases scikit-learn has a number of options you can consider to make your system scale.

Scaling with instances using out-of-core learning

外存学习是扩展学习能力的一种技术。

其不需要将数据全部加载到内存，可以批量地从外存（例如硬盘）中读取数据，边读取边学习。

（1）流式读取数据。

（2）提取特征。

（3）增量学习。

Out-of-core (or “external memory”) learning is a technique used to learn from data that cannot fit in a computer’s main memory (RAM).

Here is a sketch of a system designed to achieve this goal:

a way to stream instances

a way to extract features from instances

an incremental algorithm

Out-of-core

https://machinelearning.wtf/terms/out-of-core/

core可以解释为典型的计算机核心结构， CPU + RAM，常规的程序都是加载到内存后，由CPU进行调度运算。

out-of-core 是新的程序结构，打破core的壁垒，将外存（或者网络）也加入到程序的结构中来。

The term out-of-core typically refers to processing data that is too large to fit into a computer’s main memory.

Typically, when a dataset fits neatly into a computer’s main memory, randomly accessing sections of data has a (relatively) small performance penalty.

When data must be stored in a medium like a large spinning hard drive or an external computer network, it becomes very expensive to randomly seek to an arbitrary section of data or to process the same data multiple times.

In such a case, an out-of-core algorithm would try to access all relevant data in one sequence.

However, modern computers have a deep memory hierarchy, and replacing random access with sequential access can increase performance even on datasets that fit within memory.

out-of-core algorithms

https://en.wikipedia.org/wiki/External_memory_algorithm

out-of-core算法，又叫外部存储算法。针对海量数据不能完全加载到内存中的情况，设计的算法。

In computing, external memory algorithms or out-of-core algorithms are algorithms that are designed to process data that are too large to fit into a computer's main memory at once. Such algorithms must be optimized to efficiently fetch and access data stored in slow bulk memory (auxiliary memory) such as hard drives or tape drives, or when memory is on a computer network.^[1]^[2] External memory algorithms are analyzed in the external memory model.

https://github.com/kalperen/MachineLearningGuide#learning-machine-learning

此算法处理海量数据，方法是将数据切分为小的批量，并使用在线学习技术从小批量数据中学习知识。

6. What is out-of-core learning?

Out-of-core learning systems are used to handle vast quantities of data that cannot fit in a computer's main memory. They work by chopping the data into mini-batches and using online learning techniques to learn from these mini-batches.

Streaming instances

设计一个数据读取器，从硬盘中的文件，或者数据库，或者网络流中依次读取数据。读取一批，消费一批，内存中不存储数据。

Basically, 1. may be a reader that yields instances from files on a hard drive, a database, from a network stream etc. However, details on how to achieve this are beyond the scope of this documentation.

Extracting features

由于数据量很大，无法预知特征空间。

如果数据允许多次浏览，则可以使用有状态的向量化方法，将向量空间确定。

如果数据不允许多次浏览，因为实在太耗时，则可以使用无状态的向量化方法，构建有限维度的特征空间。例如使用hash方法FeatureHasher，使用 HashingVectorizer 工具抽取文档中的词向量空间。

2. could be any relevant way to extract features among the different feature extraction methods supported by scikit-learn. However, when working with data that needs vectorization and where the set of features or values is not known in advance one should take explicit care. A good example is text classification where unknown terms are likely to be found during training. It is possible to use a stateful vectorizer if making multiple passes over the data is reasonable from an application point of view. Otherwise, one can turn up the difficulty by using a stateless feature extractor. Currently the preferred way to do this is to use the so-called hashing trick as implemented by sklearn.feature_extraction.FeatureHasher for datasets with categorical variables represented as list of Python dicts or sklearn.feature_extraction.text.HashingVectorizer for text documents.

FeatureHasher

https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.FeatureHasher.html#sklearn.feature_extraction.FeatureHasher

可以将object词典对象，转换成 hash向量值。

Implements feature hashing, aka the hashing trick.

This class turns sequences of symbolic feature names (strings) into scipy.sparse matrices, using a hash function to compute the matrix column corresponding to a name. The hash function employed is the signed 32-bit version of Murmurhash3.

Feature names of type byte string are used as-is. Unicode strings are converted to UTF-8 first, but no Unicode normalization is done. Feature values must be (finite) numbers.

This class is a low-memory alternative to DictVectorizer and CountVectorizer, intended for large-scale (online) learning and situations where memory is tight, e.g. when running prediction code on embedded devices.

>>> from sklearn.feature_extraction import FeatureHasher
>>> h = FeatureHasher(n_features=10)
>>> D = [{'dog': 1, 'cat':2, 'elephant':4},{'dog': 2, 'run': 5}]
>>> f = h.transform(D)
>>> f.toarray()
array([[ 0.,  0., -4., -1.,  0.,  0.,  0.,  0.,  0.,  2.],
       [ 0.,  0.,  0., -2., -5.,  0.,  0.,  0.,  0.,  0.]])

实际上，它支持的输入类型为词典、二元元组、字符串。

input_type{“dict”, “pair”, “string”}, default=”dict”

Either “dict” (the default) to accept dictionaries over (feature_name, value); “pair” to accept pairs of (feature_name, value); or “string” to accept single strings. feature_name should be a string, while value should be a number. In the case of “string”, a value of 1 is implied. The feature_name is hashed to find the appropriate column for the feature. The value’s sign might be flipped in the output (but see non_negative, below).

HashingVectorizer

https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.HashingVectorizer.html#sklearn.feature_extraction.text.HashingVectorizer

专门针对词频统计的有限维的hash方法。

将词映射到一个有限的特征空间，只不过这个映射方法为hash方法，可能存在映射冲突。但是实践上，如果特征空间为 2 ** 18，则可以有效避免冲突。

此方法是无状态方法，只能做文档自身的词频统计，无法做逆文档频率。

Convert a collection of text documents to a matrix of token occurrences

It turns a collection of text documents into a scipy.sparse matrix holding token occurrence counts (or binary occurrence information), possibly normalized as token frequencies if norm=’l1’ or projected on the euclidean unit sphere if norm=’l2’.

This text vectorizer implementation uses the hashing trick to find the token string name to feature integer index mapping.

This strategy has several advantages:

it is very low memory scalable to large datasets as there is no need to store a vocabulary dictionary in memory

it is fast to pickle and un-pickle as it holds no state besides the constructor parameters

it can be used in a streaming (partial fit) or parallel pipeline as there is no state computed during fit.

There are also a couple of cons (vs using a CountVectorizer with an in-memory vocabulary):

there is no way to compute the inverse transform (from feature indices to string feature names) which can be a problem when trying to introspect which features are most important to a model.

there can be collisions: distinct tokens can be mapped to the same feature index. However in practice this is rarely an issue if n_features is large enough (e.g. 2 ** 18 for text classification problems).

no IDF weighting as this would render the transformer stateful.

The hash function employed is the signed 32-bit version of Murmurhash3.

>>> from sklearn.feature_extraction.text import HashingVectorizer
>>> corpus = [
...     'This is the first document.',
...     'This document is the second document.',
...     'And this is the third one.',
...     'Is this the first document?',
... ]
>>> vectorizer = HashingVectorizer(n_features=2**4)
>>> X = vectorizer.fit_transform(corpus)
>>> print(X.shape)
(4, 16)

Incremental learning

不是所有的算法都支持增量学习，所有实现 partial_fit 的模型都是支持的。

事实上，从小批量数据中增量学习的能力，是外存学习的关键。这样就能保证，在任意时间，在内存中存储的只是一个很小的样本实例。

Finally, for 3. we have a number of options inside scikit-learn. Although not all algorithms can learn incrementally (i.e. without seeing all the instances at once), all estimators implementing the partial_fit API are candidates. Actually, the ability to learn incrementally from a mini-batch of instances (sometimes called “online learning”) is key to out-of-core learning as it guarantees that at any given time there will be only a small amount of instances in the main memory. Choosing a good size for the mini-batch that balances relevancy and memory footprint could involve some tuning 1.

支持增量的一些算法，分类模型、聚类模型、回归模型和降维特征提取以及数据预处理。

Here is a list of incremental estimators for different tasks:

Classification

sklearn.naive_bayes.MultinomialNB

sklearn.naive_bayes.BernoulliNB

sklearn.linear_model.Perceptron

sklearn.linear_model.SGDClassifier

sklearn.linear_model.PassiveAggressiveClassifier

sklearn.neural_network.MLPClassifier

Regression

sklearn.linear_model.SGDRegressor

sklearn.linear_model.PassiveAggressiveRegressor

sklearn.neural_network.MLPRegressor

Clustering

sklearn.cluster.MiniBatchKMeans

sklearn.cluster.Birch

Decomposition / feature Extraction

sklearn.decomposition.MiniBatchDictionaryLearning

sklearn.decomposition.IncrementalPCA

sklearn.decomposition.LatentDirichletAllocation

Preprocessing

sklearn.preprocessing.StandardScaler

sklearn.preprocessing.MinMaxScaler

sklearn.preprocessing.MaxAbsScaler

目标类别在partial_fit中要全部声明。

For classification, a somewhat important thing to note is that although a stateless feature extraction routine may be able to cope with new/unseen attributes, the incremental learner itself may be unable to cope with new/unseen targets classes. In this case you have to pass all the possible classes to the first partial_fit call using the classes= parameter.

Another aspect to consider when choosing a proper algorithm is that not all of them put the same importance on each example over time. Namely, the Perceptron is still sensitive to badly labeled examples even after many examples whereas the SGD* and PassiveAggressive* families are more robust to this kind of artifacts. Conversely, the latter also tend to give less importance to remarkably different, yet properly labeled examples when they come late in the stream as their learning rate decreases over time.

Out-of-core classification of text documents --- Examples

https://scikit-learn.org/stable/auto_examples/applications/plot_out_of_core_classification.html#sphx-glr-auto-examples-applications-plot-out-of-core-classification-py

本例子，包括以下部分：

（1）从路透社新闻的标记数据中，构建训练使用的流式数据读取器，并预留好测试数据集合。以标签中有 acq 作为目标，进行二值预测。

（2）构建多个在线学习算法 out-of-core算法，执行增量学习。

（3）输出训练参数和预测统计参数。

This is an example showing how scikit-learn can be used for classification using an out-of-core approach: learning from data that doesn’t fit into main memory. We make use of an online classifier, i.e., one that supports the partial_fit method, that will be fed with batches of examples. To guarantee that the features space remains the same over time we leverage a HashingVectorizer that will project each example into the same feature space. This is especially useful in the case of text classification where new features (words) may appear in each batch.

Finally, we have a full-fledged example of Out-of-core classification of text documents. It is aimed at providing a starting point for people wanting to build out-of-core learning systems and demonstrates most of the notions discussed above.

Furthermore, it also shows the evolution of the performance of different algorithms with the number of processed examples.

Now looking at the computation time of the different parts, we see that the vectorization is much more expensive than learning itself. From the different algorithms, MultinomialNB is the most expensive, but its overhead can be mitigated by increasing the size of the mini-batches (exercise: change minibatch_size to 100 and 10000 in the program and compare).

核心code：

https://scikit-learn.org/stable/auto_examples/applications/plot_out_of_core_classification.html#sphx-glr-auto-examples-applications-plot-out-of-core-classification-py

使用生成器函数，产生流式读取效果。

def stream_reuters_documents(data_path=None):
    """Iterate over documents of the Reuters dataset.

    The Reuters archive will automatically be downloaded and uncompressed if
    the `data_path` directory does not exist.

    Documents are represented as dictionaries with 'body' (str),
    'title' (str), 'topics' (list(str)) keys.

    """

    DOWNLOAD_URL = ('http://archive.ics.uci.edu/ml/machine-learning-databases/'
                    'reuters21578-mld/reuters21578.tar.gz')
    ARCHIVE_FILENAME = 'reuters21578.tar.gz'

    if data_path is None:
        data_path = os.path.join(get_data_home(), "reuters")
    if not os.path.exists(data_path):
        """Download the dataset."""
        print("downloading dataset (once and for all) into %s" %
              data_path)
        os.mkdir(data_path)

        def progress(blocknum, bs, size):
            total_sz_mb = '%.2f MB' % (size / 1e6)
            current_sz_mb = '%.2f MB' % ((blocknum * bs) / 1e6)
            if _not_in_sphinx():
                sys.stdout.write(
                    '
downloaded %s / %s' % (current_sz_mb, total_sz_mb))

        archive_path = os.path.join(data_path, ARCHIVE_FILENAME)
        urlretrieve(DOWNLOAD_URL, filename=archive_path,
                    reporthook=progress)
        if _not_in_sphinx():
            sys.stdout.write('
')
        print("untarring Reuters dataset...")
        tarfile.open(archive_path, 'r:gz').extractall(data_path)
        print("done.")

    parser = ReutersParser()
    for filename in glob(os.path.join(data_path, "*.sgm")):
        for doc in parser.parse(open(filename, 'rb')):
            yield doc

使用哈希向量化工具，映射文档词频到固定维度特征空间。

get_minibatch 定义特征数据， 和目标数据。

vectorizer = HashingVectorizer(decode_error='ignore', n_features=2 ** 18,
                               alternate_sign=False)


# Iterator over parsed Reuters SGML files.
data_stream = stream_reuters_documents()

# We learn a binary classification between the "acq" class and all the others.
# "acq" was chosen as it is more or less evenly distributed in the Reuters
# files. For other datasets, one should take care of creating a test set with
# a realistic portion of positive instances.
all_classes = np.array([0, 1])
positive_class = 'acq'

# Here are some classifiers that support the `partial_fit` method
partial_fit_classifiers = {
    'SGD': SGDClassifier(max_iter=5),
    'Perceptron': Perceptron(),
    'NB Multinomial': MultinomialNB(alpha=0.01),
    'Passive-Aggressive': PassiveAggressiveClassifier(),
}


def get_minibatch(doc_iter, size, pos_class=positive_class):
    """Extract a minibatch of examples, return a tuple X_text, y.

    Note: size is before excluding invalid docs with no topics assigned.

    """
    data = [('{title}

{body}'.format(**doc), pos_class in doc['topics'])
            for doc in itertools.islice(doc_iter, size)
            if doc['topics']]
    if not len(data):
        return np.asarray([], dtype=int), np.asarray([], dtype=int)
    X_text, y = zip(*data)
    return X_text, np.asarray(y, dtype=int)


def iter_minibatches(doc_iter, minibatch_size):
    """Generator of minibatches."""
    X_text, y = get_minibatch(doc_iter, minibatch_size)
    while len(X_text):
        yield X_text, y
        X_text, y = get_minibatch(doc_iter, minibatch_size)

训练阶段，生成器被循环调用，在循环体中，调用模型进行增量训练。

# We will feed the classifier with mini-batches of 1000 documents; this means
# we have at most 1000 docs in memory at any time.  The smaller the document
# batch, the bigger the relative overhead of the partial fit methods.
minibatch_size = 1000

# Create the data_stream that parses Reuters SGML files and iterates on
# documents as a stream.
minibatch_iterators = iter_minibatches(data_stream, minibatch_size)
total_vect_time = 0.0

# Main loop : iterate on mini-batches of examples
for i, (X_train_text, y_train) in enumerate(minibatch_iterators):

    tick = time.time()
    X_train = vectorizer.transform(X_train_text)
    total_vect_time += time.time() - tick

    for cls_name, cls in partial_fit_classifiers.items():
        tick = time.time()
        # update estimator with examples in the current mini-batch
        cls.partial_fit(X_train, y_train, classes=all_classes)