并发编程——进程池与线程池

一、简单介绍

在学习了多进程或多线程之后，我们可能会迫不及待地基于多进程或多线程做一些开发，然而毫无节制的开启进程或线程是十分危险的。

服务开启的进程数或线程数都会随着并发的客户端数目地增多而增多，这会对服务端主机带来巨大的压力，甚至于不堪重负而瘫痪。

所以我们必须对服务端开启的进程数或线程数加以控制，让机器在一个自己可以承受的范围内运行，这就是进程池或线程池的用途，例如进程池，就是用来存放进程的池子，本质还是基于多进程，只不过是对开启进程的数目加上了限制。

concurrent.futures模块提供了高度封装的异步调用接口
ThreadPoolExecutor：线程池，提供异步调用
ProcessPoolExecutor: 进程池，提供异步调用
Both implement the same interface, which is defined by the abstract Executor class.
两者都实现相同的接口，该接口由抽象执行器类定义。

基本方法

1、submit(fn, *args, **kwargs)
异步提交任务

2、map(func, *iterables, timeout=None, chunksize=1) 
取代for循环submit的操作

3、shutdown(wait=True) 
相当于进程池的pool.close()+pool.join()操作
wait=True，等待池内所有任务执行完毕回收完资源后才继续
wait=False，立即返回，并不会等待池内的任务执行完毕
但不管wait参数为何值，整个程序都会等到所有任务执行完毕
submit和map必须在shutdown之前

4、result(timeout=None)
取得结果

5、add_done_callback(fn)
回调函数

二、进程池

The ProcessPoolExecutor class is an Executor subclass that uses a pool of processes to execute calls asynchronously.
ProcessPoolExecutor类是一个Executor子类，它使用进程池异步执行调用。

ProcessPoolExecutor uses the multiprocessing module, which allows it to side-step the Global Interpreter Lock but also means that only picklable objects can be executed and returned.
ProcessPoolExecutor使用多处理模块，这允许它绕过全局解释器锁，但也意味着只能执行和返回可选择的对象。

class ProcessPoolExecutor(_base.Executor):
    def __init__(self, max_workers=None, mp_context=None,
                 initializer=None, initargs=()):
        """Initializes a new ProcessPoolExecutor instance.
		   初始化新的ProcessPoolExecutor实例。
        Args:
        参数：
            max_workers: The maximum number of processes that can be used to execute the given calls. 
            max_workers: 可用于执行给定调用的最大进程数。
            	If None or not given then as many worker processes will be created as the machine has processors.
            	如果没有或没有给定，那么将创建与计算机具有处理器一样多的工作进程。
            mp_context: A multiprocessing context to launch the workers. 
            mp_context: 启动工作进程的多处理上下文。
            	This object should provide SimpleQueue, Queue and Process.
            	这个对象应该提供SimpleQueue、Queue和Process。
            initializer: A callable used to initialize worker processes.
            initializer: 用于初始化工作进程的可调用函数。
            initargs: A tuple of arguments to pass to the initializer.
            initargs: 传递给初始值设定项的参数元组。
        """

An Executor subclass that executes calls asynchronously using a pool of at most max_workers processes.
一个执行器子类，它使用最多由max_workers进程组成的池异步执行调用。

If max_workers is None or not given, it will default to the number of processors on the machine.
如果max_workers为None或not given，则默认为计算机上的处理器数。

If max_workers is lower or equal to 0, then a ValueError will be raised.
如果max_workers小于或等于0，则将引发ValueError。

import os
import time
import random
from concurrent.futures import ProcessPoolExecutor


def task(n):
    print("n =", n, "on", os.getpid(), "is running.")
    time.sleep(random.randint(1, 3))
    return n ** 2


if __name__ == '__main__':
    executor = ProcessPoolExecutor(max_workers=3)

    futures = []
    for i in range(10):
        future = executor.submit(task, i)
        futures.append(future)
    executor.shutdown(True)
    for future in futures:
        print(future.result())

输出结果为：

n = 0 on 15988 is running.
n = 1 on 4492 is running.
n = 2 on 7772 is running.
n = 3 on 4492 is running.
n = 4 on 15988 is running.
n = 5 on 7772 is running.
n = 6 on 15988 is running.
n = 7 on 4492 is running.
n = 8 on 15988 is running.
n = 9 on 7772 is running.
0
1
4
9
16
25
36
49
64
81

三、线程池

ThreadPoolExecutor is an Executor subclass that uses a pool of threads to execute calls asynchronously.
ThreadPoolExecutor是一个Executor子类，它使用线程池异步执行调用。

class ThreadPoolExecutor(_base.Executor):
    # Used to assign unique thread names when thread_name_prefix is not supplied.
    _counter = itertools.count().__next__

    def __init__(self, max_workers=None, thread_name_prefix='',
                 initializer=None, initargs=()):
        """Initializes a new ThreadPoolExecutor instance.
		   初始化新的ThreadPoolExecutor实例。
        Args:
        参数：
            max_workers: The maximum number of threads that can be used to execute the given calls.
            max_workers: 可用于执行给定调用的最大线程数。
            thread_name_prefix: An optional name prefix to give our threads.
            thread_name_prefix: 提供线程的可选名称前缀。
            initializer: A callable used to initialize worker threads.
            initializer: 用于初始化工作线程的可调用函数。
            initargs: A tuple of arguments to pass to the initializer.
            initargs: 传递给初始值设定项的参数元组。
        """

An Executor subclass that uses a pool of at most max_workers threads to execute calls asynchronously.
一个执行器子类，它使用最多由max_workers线程组成的池异步执行调用。

Changed in version 3.5: If max_workers is None or not given, it will default to the number of processors on the machine, multiplied by 5, assuming that ThreadPoolExecutor is often used to overlap I/O instead of CPU work and the number of workers should be higher than the number of workers for ProcessPoolExecutor.
在版本3.5中进行了更改：如果max_workers为None或not given，则它将默认为计算机上的处理器数乘以5，假设ThreadPoolExecutor通常用于重叠I/O而不是CPU工作，并且工作数应高于ProcessPoolExecutor的工作数。

New in version 3.6: The thread_name_prefix argument was added to allow users to control the threading.Thread names for worker threads created by the pool for easier debugging.
版本3.6中的新功能：添加了thread_name_prefix参数，以允许用户控制线程。池创建的工作线程的线程名，以便于调试。

import time
import random
import threading
from concurrent.futures import ThreadPoolExecutor


def task(n):
    print("n =", n, "on", threading.currentThread().getName(), "is running.")
    time.sleep(random.randint(1, 3))
    return n ** 2


if __name__ == '__main__':
    executor = ThreadPoolExecutor(max_workers=3)

    futures = []
    for i in range(10):
        future = executor.submit(task, i)
        futures.append(future)
    executor.shutdown(True)
    for future in futures:
        print(future.result())

输出结果为：

n = 0 on ThreadPoolExecutor-0_0 is running.
n = 1 on ThreadPoolExecutor-0_1 is running.
n = 2 on ThreadPoolExecutor-0_2 is running.
n = 3 on ThreadPoolExecutor-0_2 is running.
n = 4 on ThreadPoolExecutor-0_0 is running.
n = 5 on ThreadPoolExecutor-0_1 is running.
n = 6 on ThreadPoolExecutor-0_2 is running.
n = 7 on ThreadPoolExecutor-0_1 is running.
n = 8 on ThreadPoolExecutor-0_0 is running.
n = 9 on ThreadPoolExecutor-0_2 is running.
0
1
4
9
16
25
36
49
64
81

四、map方法

map方法是为了取代for循环submit操作。

看一下源码：

    def map(self, fn, *iterables, timeout=None, chunksize=1):
        """Returns an iterator equivalent to map(fn, iter).
           返回与map（fn，iter）等价的迭代器。

        Args:
        参数：
            fn: A callable that will take as many arguments as there are passed iterables.
            fn: 一个可调用的，它将接受与传递的iterable一样多的参数。
            timeout: The maximum number of seconds to wait. If None, then there is no limit on the wait time.
            timeout: 等待的最大秒数。如果没有，那么等待时间是没有限制的。
            chunksize: If greater than one, the iterables will be chopped into chunks of size chunksize and submitted to the process pool.
            chunksize: 如果大于1，则iterable将被切碎为大小为chunksize的块，并提交给流程池。
                If set to one, the items in the list will be sent one at a time.
                如果设置为1，列表中的项目将一次发送一个。

        Returns:
        返回值：
            An iterator equivalent to: map(func, *iterables) but the calls may be evaluated out-of-order.
            相当于：map（func，*iterables）的迭代器，但调用的求值顺序可能不正确。

        Raises:
            TimeoutError: If the entire result iterator could not be generated before the given timeout.
            TimeoutError: 如果在给定超时之前无法生成整个结果迭代器。
            Exception: If fn(*args) raises for any values.
            Exception: 如果fn（*args）为任何值提升。
        """
        if chunksize < 1:
            raise ValueError("chunksize must be >= 1.")

        results = super().map(partial(_process_chunk, fn),
                              _get_chunks(*iterables, chunksize=chunksize),
                              timeout=timeout)
        return _chain_from_iterable_of_lists(results)

现在就可以利用map将之前写的进程池简化一下：

import os
import time
import random
from concurrent.futures import ProcessPoolExecutor


def task(n):
    print("n =", n, "on", os.getpid(), "is running.")
    time.sleep(random.randint(1, 3))
    return n ** 2


if __name__ == '__main__':
    executor = ProcessPoolExecutor(max_workers=3)

    futures = []
    executor.map(task, range(1, 10))
    executor.shutdown(True)
    for future in futures:
        print(future.result())

输出结果为：

n = 1 on 9996 is running.
n = 2 on 10252 is running.
n = 3 on 5708 is running.
n = 4 on 10252 is running.
n = 5 on 9996 is running.
n = 6 on 10252 is running.
n = 7 on 5708 is running.
n = 8 on 9996 is running.
n = 9 on 5708 is running.

五、回调函数

可以为进程池或线程池内的每个进程或线程绑定一个函数，该函数在进程或线程的任务执行完毕后自动触发，并接收任务的返回值当作参数，该函数称为回调函数。

简单说就是：进程池中任何一个任务一旦处理完了，就立即告知主进程：我好了，你可以处理我的结果了，主进程则调用一个函数去处理该结果，该函数即回调函数。

我们可以把耗时间（阻塞）的任务放到进程池中，然后指定回调函数（主进程负责执行），这样主进程在执行回调函数时就省去了I/O的过程，直接拿到的是任务的结果。

比较经典的应用场景就是爬虫，可以使用多个线程来请求URL来减少网络等待时间，然后每爬取完一个网页就对其进行相应的处理：

import requests
import threading
from concurrent.futures import ThreadPoolExecutor


def get_page(website):
    print(threading.currentThread().getName(), "get url:", website)
    response = requests.get(website)
    if response.status_code == 200:
        return {"url": website, "text": response.text}


def parse_page(res):
    res = res.result()
    print(threading.currentThread().getName(), "parse", res['url'])
    parse_res = 'url:<%s> size:[%s]
' % (res['url'], len(res['text']))
    with open('db.txt', 'a') as f:
        f.write(parse_res)


if __name__ == '__main__':
    urls = [
        "https://www.baidu.com/",
        "http://www.tust.edu.cn/",
        "https://www.bilibili.com/",
        "https://www.luffycity.com/home",
        "https://blog.csdn.net/weixin_43336281"
    ]
    executor = ThreadPoolExecutor(5)
    for url in urls:
        executor.submit(get_page, url).add_done_callback(parse_page)

输出结果为：

ThreadPoolExecutor-0_0 get url: https://www.baidu.com/
ThreadPoolExecutor-0_1 get url: http://www.tust.edu.cn/
ThreadPoolExecutor-0_2 get url: https://www.bilibili.com/
ThreadPoolExecutor-0_3 get url: https://www.luffycity.com/home
ThreadPoolExecutor-0_4 get url: https://blog.csdn.net/weixin_43336281
ThreadPoolExecutor-0_1 parse http://www.tust.edu.cn/
ThreadPoolExecutor-0_0 parse https://www.baidu.com/
ThreadPoolExecutor-0_3 parse https://www.luffycity.com/home
ThreadPoolExecutor-0_4 parse https://blog.csdn.net/weixin_43336281
ThreadPoolExecutor-0_2 parse https://www.bilibili.com/