Python 9 进程，线程

本节内容

　　python GIL全局解释器锁

　　线程

　　进程

Python GIL(Global Interpreter Lock)

In CPython, the global interpreter lock, or GIL, is a mutex that prevents multiple native threads from executing Python bytecodes at once. This lock is necessary mainly because CPython’s memory management is not thread-safe. (However, since the GIL exists, other features have grown to depend on the guarantees that it enforces.)

上面的意思大概就是说，无论你开启多少个进程，有CPU是几核，Python在执行的时候会淡定的在同一时刻只允许一个线程运行。这就是CPython的缺点，假的多线程。

首先，需要明确的是GIL并不是Python的特征，它是在实现Python解析器（CPython）时所引入的一个概念。就好比C++是一套语言（语法）标准，但是可以用不同的编译器来便宜成可执行代码。有名的编译器例如GCC， INTEL C++，Visual C++等。Python也一样，同样一段代码可以通过CPython，PyPy，Psyco等不同的Python执行环境来执行，像其中的JPython就没有GIL。然而因为CPython是大部分环境下默认的Python执行换机，所以的很多人的概念里面CPython就等同于Python，也就想当然的把GIL归类为Python语言的缺陷。所以首先要明确一点：GIL并不是Python的特性，Python完全可以不依赖与GIL。

附上一个透彻分析GIL对Python多线程影响的链接：http://www.dabeaz.com/python/UnderstandingGIL.pdf

Python threading 模块

线程有2种调用方式，如下：

直接调用

import threading
import time

def sayhi(num):
    print("test run ==>", num)
    time.sleep(2)

if __name__ == "__main__":
    t1 = threading.Thread(target=sayhi, args=(1,))
    t2 = threading.Thread(target=sayhi, args=(2,))

    t1.start()
    t2.start()
    print(t1.getName())
    print(t2.getName())

继承式调用

class MyThread(threading.Thread):
    def __init__(self, num):
        threading.Thread.__init__(self)
        self.num = num

    def run(self):
        print("test run ==>", self.num)
        time.sleep(3)

if __name__ == "__main__":
    t1 = MyThread(1)
    t2 = MyThread(2)
    t1.start()
    t2.start()

Join &Daemon

Some threads do background tasks, like sending keepalive packets, or performing periodic garbage collection, or whatever. These are only useful when the main program is running, and it's okay to kill them off once the other, non-daemon, threads have exited.

Without daemon threads, you'd have to keep track of them, and tell them to exit, before your program can completely quit. By setting them as daemon threads, you can let them run and forget about them, and when your program quits, any daemon threads are killed automatically.

import time
import threading
def run(n):
    print("----- run:%s -----" % n)
    time.sleep(2)
    print("----done----")
def main():
    for i in range(5):
        t = threading.Thread(target=run, args=(i, ))
        t.start()
        t.join(1)
        print("start threading", t.getName())
m = threading.Thread(target=main, args=[])
m.setDaemon(True)
m.start()
m.join(timeout=2)
print("----main thread done----")

Join的作用是等待所有线程结束，但是这边设置了timeout，setDaemon是设置守护线程，守护线程通俗上讲就是起辅助作用的线程，不影响主线程与其他线程的执行，待其他线程执行完毕，可以不必考虑守护线程的运行状态，直接结束。

线程锁（互斥锁Mutex）

一个进程下可以启动多个线程，多个线程共享父进程的内存空间，也就意味着所有线程都可以访问同一份数据，此时，如果2个线程都要对同一份数据进行修改更新操作，会出现什么情况？

import time
import threading

def addNum():
    global num
    print("--get num:", num)
    time.sleep(1)
    num +=1

num = 0
thread_list = []
for i in range(100):
    t =  threading.Thread(target=addNum)
    t.start()
    thread_list.append(t)
    
for t in thread_list:
    t.join()
    
print("ending ===>", num)

# 讲道理，这个在python2.7上多运行几次应该可以发现结果不是100, anyway。

这时候就可以引入线程锁了，在python执行线程的过程中，只要给线程加了线程锁，在次县城运行过程中，其他进程便不可访问这份数据，直到此线程结束。

import time
import threading

def addNum():
    global num
    print("--get num:", num)
    time.sleep(1)
    lock.acquire()  # 先加锁后处理数据
    num += 1
    lock.release()  # 数据处理完后解除，释放掉线程锁（互斥锁）

num = 0  # 全局变量
thread_list = []
lock = thread_list.Lock()  # 生成线程锁
for i in range(100):
    t =  threading.Thread(target=addNum)
    t.start()
    thread_list.append(t)

for t in thread_list:
    t.join()
# 等待所有线程执行完成
print("ending ===>", num)

这时你可能有一点疑惑, 因为之前我们提到GIL保证了在同一时间只有一个线程执行,为什么这里还是要Mutex这样一个互斥锁呢?

其实这里的lock是用户级的lock,跟那个GIL没关系,具体我们可以根据一张图来看一下:

基本就是说,其实线程是根据python里面的上下文执行解释器来串行执行的,当线程1去到count值,但是线程一还没执行完,这时候就要执行线程2时,线程2对count进行了+1处理,返回给了公共数据池,但是再继续执行线程1没有走完的部分,线程一因为已经取到了数据count=0, 会执行继续+1,count就会=1,这就是问题所在.

也许你又会问,既然用户程序已经自己加上了互斥锁,那么CPython问为什么还需要GIL呢?加入GIL主要的原因是为了降低程序开发的复杂度,比如现在你写python不需要关心内存回收的问题,因为python解释器帮你自动定期进行内存回收,你可以理解为python解释器里有一个独立的线程,美国一段时间,便唤醒它进行一次全局查询,看看哪些内存数据是可以被清空的,此时你自己程序里面的线程和Python解释器自己的线程是并发运行的,假设你的线程删除了一个变量,py解释器的垃圾回收线程在清空这个变量的过程中的clearing时刻,可能一个其他线程刚好又重新给这个还没来得及清空的内存空间赋值了,结果就是,新赋值的数据被删除,为了解决类似的问题,python就简单粗暴的加了一个锁,只允许单线程运行,即当一个线程运行的时候,其他线程都不可以动,这样就解决了上述问题,这可以说是python早期版本的遗留问题.

RLock(递归锁)

其实就是大锁里面的小锁

#!/user/bin/env python
# -*-coding: utf-8-*-

import threading, time

def run1():
    print("grab the first part data")
    lock.acquire()
    global num
    num += 1
    lock.release()
    return num

def run2():
    print("grab the second part data")
    lock.acquire()
    global num2
    num2 += 1
    lock.release()
    return num2

def run3():
    lock.acquire()
    res = run1()
    print('--------between run1 and run2-----')
    res2 = run2()
    lock.release()
    print(res, res2)

if __name__ == '__main__':
    num, num2 = 0, 0
    lock = threading.RLock()
    for i in range(10):
        t = threading.Thread(target=run3)
        t.start()
while threading.active_count() != 1:
    print(threading.active_count())
else:
    print('----all threads done---')
    print(num, num2)

Semaphore(信号量)

互斥锁，同时只允许一个线程更改数据，而Semaphore是同事允许一定数量的线程更改数据，比如网吧只有50台电脑，最多只能50个人上网，后面的人只能等前面的人下机了才能去上网。

#!/user/bin/env python
# -*-coding: utf-8-*-
import threading,time

def run(n):
    semaphore.acquire()
    time.sleep(1)
    print("run ==> %s " % n)
    semaphore.release()

if __name__ == "__main__":
    semaphore = threading.BoundedSemaphore(5)  # 最大同时运行线程的数量
    for i in range(20):
        t = threading.Thread(target=run, args=(i, ))
        t.start()

while threading.active_count()!= 1:
    time.sleep(1)
    print(threading.active_count())
else:
    print("------All Done------")

Timer

This class represents an action that should be run only after a certain amount of time has passed

Timers are started, as with threads, by calling their start() method. The timer can be stopped (before its action has begun) by calling thecancel() method. The interval the timer will wait before executing its action may not be exactly the same as the interval specified by the user.

import threading

def hello():
    print("hello world")

t = threading.Timer(5, hello)
t.start()

Events

An event is a simple synchronization object;

the event represents an internal flag, and threads can wait for the flag to be set, or set or clear the flag themselves.
event = threading.Event()
# a client thread can wait for the flag to be set event.wait()
# a server thread can set or reset it event.set() event.clear() If the flag is set, the wait method doesn’t do anything. If the flag is cleared, wait will block until it becomes set again. Any number of threads may wait for the same event.

可以通过Event来实现两个获多个线程的交互，下面我们通过一个红绿灯例子，来看多个线程之间的执行。

#!/user/bin/env python
# -*-coding: utf-8-*-

import threading,time


def light():
    i = 0
    event.set()
    while True:
        if i >= 10:
            event.set()  # 信号代表绿灯
            i = 0
            print("33[42;1m绿灯请出行》》》》33[0m")
        elif i >= 5 and i < 10:
            event.clear()
            print("33[41;1m红灯请停步》》》》33[0m")
        else:
            print("33[42;1m绿灯请出行》》》》33[0m")
        time.sleep(1)
        i += 1


def car():
    while True:
        if event.is_set():
            print("car is running")
            time.sleep(1)
        else:
            print("car is waiting for greening light")
            event.wait()

event = threading.Event()


_Light = threading.Thread(target=light)
_Car = threading.Thread(target=car)
_Light.start()
_Car.start()

再给一个例子：

#_*_coding:utf-8_*_
__author__ = 'Alex Li'
import threading
import time
import random

def door():
    door_open_time_counter = 0
    while True:
        if door_swiping_event.is_set():
            print("33[32;1mdoor opening....33[0m")
            door_open_time_counter +=1

        else:
            print("33[31;1mdoor closed...., swipe to open.33[0m")
            door_open_time_counter = 0 #清空计时器
            door_swiping_event.wait()


        if door_open_time_counter > 3:#门开了已经3s了,该关了
            door_swiping_event.clear()

        time.sleep(0.5)


def staff(n):

    print("staff [%s] is comming..." % n )
    while True:
        if door_swiping_event.is_set():
            print("33[34;1mdoor is opened, passing.....33[0m")
            break
        else:
            print("staff [%s] sees door got closed, swipping the card....." % n)
            print(door_swiping_event.set())
            door_swiping_event.set()
            print("after set ",door_swiping_event.set())
        time.sleep(0.5)
door_swiping_event  = threading.Event() #设置事件


door_thread = threading.Thread(target=door)
door_thread.start()



for i in range(5):
    p = threading.Thread(target=staff,args=(i,))
    time.sleep(random.randrange(3))
    p.start()

queue队列

queue is especially useful in threaded programming when information must be exchanged safely between multiple threads.

class queue.Queue(maxsize=0) #先入先出

class queue.LifoQueue(maxsize=0) #last in fisrt out
class queue.PriorityQueue(maxsize=0) #存储数据时可设置优先级的队列

Constructor for a priority queue. maxsize is an integer that sets the upperbound limit on the number of items that can be placed in the queue. Insertion will block once this size has been reached, until queue items are consumed. If maxsize is less than or equal to zero, the queue size is infinite.

The lowest valued entries are retrieved first (the lowest valued entry is the one returned by sorted(list(entries))[0]). A typical pattern for entries is a tuple in the form: (priority_number, data).

exception queue.Empty: Exception raised when non-blocking get() (or get_nowait()) is called on a Queue object which is empty.

exception queue.Full: Exception raised when non-blocking put() (or put_nowait()) is called on a Queue object which is full.

Queue.qsize()

Queue.empty() #return True if empty

Queue.full() # return True if full

Queue.put(item, block=True, timeout=None): Put item into the queue. If optional args block is true and timeout is None (the default), block if necessary until a free slot is available. If timeout is a positive number, it blocks at most timeout seconds and raises the Full exception if no free slot was available within that time. Otherwise (block is false), put an item on the queue if a free slot is immediately available, else raise the Full exception (timeout is ignored in that case).

Queue.put_nowait(item): Equivalent to put(item, False).

Queue.get(block=True, timeout=None): Remove and return an item from the queue. If optional args block is true and timeout is None (the default), block if necessary until an item is available. If timeout is a positive number, it blocks at most timeout seconds and raises the Empty exception if no item was available within that time. Otherwise (block is false), return an item if one is immediately available, else raise the Empty exception (timeout is ignored in that case).

Queue.get_nowait(): Equivalent to get(False).

Two methods are offered to support tracking whether enqueued tasks have been fully processed by daemon consumer threads.

Queue.task_done()

Indicate that a formerly enqueued task is complete. Used by queue consumer threads. For each get() used to fetch a task, a subsequent call to task_done() tells the queue that the processing on the task is complete.

If a join() is currently blocking, it will resume when all items have been processed (meaning that a task_done() call was received for every item that had been put() into the queue).

Raises a ValueError if called more times than there were items placed in the queue.

Queue.join() block直到queue被消费完毕

生产者消费者模型

在并发编程中使用生产者和消费者模式能够解决绝大多数并发问题。该模式通过平衡生产线程和消费线程的工作能力来提高程序的整体处理数据的速度。

为什么要使用生产者和消费者模式

在线程世界里，生产者就是生产数据的线程，消费者就是消费数据的线程。在多线程开发当中，如果生产者处理速度很快，而消费者处理速度很慢，那么生产者就必须等待消费者处理完，才能继续生产数据。同样的道理，如果消费者的处理能力大于生产者，那么消费者就必须等待生产者。为了解决这个问题于是引入了生产者和消费者模式。

什么是生产者消费者模式

生产者消费者模式是通过一个容器来解决生产者和消费者的强耦合问题。生产者和消费者彼此之间不直接通讯，而通过阻塞队列来进行通讯，所以生产者生产完数据之后不用等待消费者处理，直接扔给阻塞队列，消费者不找生产者要数据，而是直接从阻塞队列里取，阻塞队列就相当于一个缓冲区，平衡了生产者和消费者的处理能力。

举个栗子：

import threading
import queue
import time
q = queue.Queue()

def producer():
    count = 1
    while True:
        q.put("Pizza %s" % count)
        print("33[42;1mPizza %s 做好了。。33[0m" % count )
        time.sleep(1)
        count += 1

def consumer(n):
    while True > 0:
        print("%s 取到" % n, q.get())
        time.sleep(1)

p = threading.Thread(target=producer)
c = threading.Thread(target=consumer, args=("dandy",))
c1 = threading.Thread(target=consumer, args=("renee",))

p.start()
c.start()
c1.start()

多进程multiprocessing

multiprocessing is a package that supports spawning processes using an API similar to the threading module. The multiprocessing package offers both local and remote concurrency, effectively side-stepping the Global Interpreter Lock by using subprocesses instead of threads. Due to this, the multiprocessing module allows the programmer to fully leverage multiple processors on a given machine. It runs on both Unix and Windows.

from multiprocessing import Process
import time
def Foo(name):
    time.sleep(2)
    print("hello ", name)

if __name__ == "__main__":
    p = Process(target=Foo, args=("dandy", ))
    p.start()
    p.join()

下面看看进程的ID:

from multiprocessing import Process
import time
import os


def info(title):
    print(title)
    print("module name:", __name__)
    print("parent process:", os.getppid())
    print("process id:", os.getpid())


def Foo(name):
    info("Here is the title.")
    print("hello", name)

if __name__ == "__main__":
    info("33[32;1mMain Process Line33[0m")
    p = Process(target=Foo, args=("dandy", ))
    p.start()
    p.join()

打印一下上面这段的执行结果：

Main Process Line
module name: __main__
parent process: 8404
process id: 9136
Here is the title.
module name: __mp_main__
parent process: 9136
process id: 7300
hello dandy

首先先是运行主程序方法：先传参给info，然后运行info，这时候打印出来的module 那么肯定是main，即主程序，我们分析一下里面的2个进程id，getppid==》get parent process，获取父进程ID；getpid==》get process ID获取进程ID。这边需要解释的也就是顺序问题，8404在window里面可以查到，是PyCharm的进程ID，9136是这个py文件的主程序或者主进程的ID，即每一个子进程都是由一个父进程产生的。然后下面的语法就是调用Process，实例化出来一个进程，我们既然实例化出来了一个进程，那么这个进程很显然是由父进程9136，文件主进程起来的子进程：7300.这样理解起来应该可以轻松很多。

进程间通讯

前面我们已经说过了线程，回顾一下就是，线程之间是共享一份数据的，因为单线程的缘故，GIL全局解释器锁保证了同一时间只能有一个现成运行。但是由于线程是共享数据的，所以2个线程在上下文切换执行时就需要锁来保证数据的准确性。但是进程间能共享一份数据么？

举个例子，QQ跟WeChat的数据共享么？或者说QQ可以动用更改支付宝里面的数据么？明显是不可以的。所以进程间的数据，内存都是是相互独立的。所以进程间有数据交换就需要管道，需要通讯，我们就来简单介绍一下这一块.

Queues

这个使用方法跟线程threading里的queue差不多

from multiprocessing import Process,Queue

def Foo(qq):
    qq.put("hello dandy")

if __name__ == "__main__":
    q = Queue()
    p = Process(target=Foo, args=(q, ))
    p.start()
    print(q.get())
    p.join()

很简单，一眼望穿，queue就是一边put，一边get，2个不同的进程各在一边就形成了数据交换。注意进程的实例化，其实跟线程差不多，然后就是process queue的import。

Pipes

Pipes相当于一个管道，更类似于socket的发送接收。

from multiprocessing import  Process,Pipe

def Foo(conn):
    conn.send("hello dandy")
    conn.close()

if __name__ == "__main__":
    parent_conn, child_conn = Pipe()

    p = Process(target=Foo, args=(child_conn, ))
    p.start()
    print(parent_conn.recv())
    p.join ()

The Pipe() function returns a pair of connection objects connected by a pipe which by default is duplex (two-way).

The two connection objects returned by Pipe() represent the two ends of the pipe. Each connection object has send() and recv() methods (among others). Note that data in a pipe may become corrupted if two processes (or threads) try to read from or write to the same end of the pipe at the same time. Of course there is no risk of corruption from processes using different ends of the pipe at the same time.

Managers

A manager object returned by Manager() controls a server process which holds Python objects and allows other processes to manipulate them using proxies.

A manager returned by Manager() will support types list, dict, Namespace, Lock, RLock, Semaphore, BoundedSemaphore, Condition, Event, Barrier, Queue, Value and Array. For example,

#!/user/bin/env python
# -*-coding: utf-8-*-

from multiprocessing import Process,Manager
import os
def Foo(d, l):
    d[os.getpid()] = os.getpid()
    l.append(os.getpid())
if __name__ == "__main__":
    with Manager() as manager:  # manager = Manager( )==>d = Manager(.dict(
        d = manager.dict()  # 通过manager实例化出来一个用于进程通讯的dict
        l = manager.list(range(5))  # 这个就不要说了吧。。
        p_list = []
        for i in range(10):
            p = Process(target=Foo, args=(d,l))
            p.start()
            p_list.append(p)
        for res in p_list:
            res.join()
        print(d)
        print(l)

进程同步

Without using the lock output from the different processes is liable to get all mixed up.

def Foo(l,i):
    l.acquire()
    try:
        print("hello world", i)
    finally:
        l.release()

if __name__ == "__main__":
    lock = Lock()

    for i in range(10):
        Process(target=Foo,args=(lock, i)).start()

进程池

进程池内部维护一个进程序列，当使用时，则去进程池中获取一个进程，如果进程池序列中没有可供使用的进程，那么程序就会等待直到进程池中有可用的进程为止。

进程池中有2个方法：

apply & apply_async

#!/user/bin/env python
# -*-coding: utf-8-*-
from multiprocessing import Pool
import os,time

def Foo(i):
    time.sleep(2)
    print("In process Foo", os.getpid())
    return i +50

def bar(args):
    print("===> Done:", args, os.getpid())

if __name__ == "__main__":
    pool = Pool(processes=3)  # 允许的最大进程数
    print("主进程", os.getpid())

    for i in range(10):
        # pool.apply(func=Foo, args=(i, ), callback= bar) # 串行
        pool.apply_async(func=Foo, args=(i, ), callback= bar)  # 异步
    print("Ending..")
    pool.close()
    pool.join()
# 这里需要主要的是close 跟join的位置跟之前的所有遇到的都不一样，这里的顺序只能是这样
# 如果注释掉join，进程池直接关闭，进程被关闭