04:爬虫之性能相关

1.1 实现并发的常见方法

  1、简介

      1. 在编写爬虫时,性能的消耗主要在IO请求中,当单进程单线程模式下请求URL时必然会引起等待,从而使得请求整体变慢。

      2. 进程:启用进程非常浪费资源

      3. 线程:线程多,并且在阻塞过程中无法执行其他任务

      4. 协程:gevent只用起一个线程,当请求发出去后gevent就不管,永远就只有一个线程工作,谁先回来先处理

  2、实现并发几个方法比较

    1)使用线程池实现并发

#! /usr/bin/env python
# -*- coding: utf-8 -*-
import requests
from concurrent.futures import ThreadPoolExecutor

def fetch_request(url):
    result = requests.get(url)
    print(result.content)

pool = ThreadPoolExecutor(10)       # 创建一个线程池,最多开10个线程
url_list = [
    'www.google.com',
    'http://www.baidu.com',
]

for url in url_list:
    # 去线程池中获取一个线程
    # 线程去执行fetch_request方法
    pool.submit(fetch_request,url)

pool.shutdown(True)     # 主线程自己关闭,让子线程自己拿任务执行
使用线程池实现并发

    2)使用进程池实现并发

#! /usr/bin/env python
# -*- coding: utf-8 -*-
import requests
from concurrent.futures import ProcessPoolExecutor

def fetch_request(url):
    result = requests.get(url)
    print(result.text)

url_list = [
    'www.google.com',
    'http://www.bing.com',
]

if __name__ == '__main__':
    pool = ProcessPoolExecutor(10)  # 线程池
    # 缺点:线程多,并且在阻塞过程中无法执行其他任务
    for url in url_list:
        # 去线程池中获取一个进程
        # 进程去执行fetch_request方法
        pool.submit(fetch_request,url)
    pool.shutdown(True)
使用进程池实现并发

    3)多线程+回调函数执行

#! /usr/bin/env python
# -*- coding: utf-8 -*-
from concurrent.futures import ThreadPoolExecutor
import requests

def fetch_async(url):
    response = requests.get(url)
    return response

def callback(future):
    print(future.result().content)

if __name__ == '__main__':
    url_list = ['http://www.github.com', 'http://www.bing.com']
    pool = ThreadPoolExecutor(5)
    for url in url_list:
        v = pool.submit(fetch_async, url)
        v.add_done_callback(callback)
    pool.shutdown(wait=True)
多线程+回调函数执行

    4) 协程:微线程实现异步

#! /usr/bin/env python
# -*- coding: utf-8 -*-
import gevent
import requests
from gevent import monkey

monkey.patch_all()

# 这些请求谁先回来就先处理谁
def fetch_async(method, url, req_kwargs):
    print(method, url, req_kwargs)
    response = requests.request(method=method, url=url, **req_kwargs)
    print(response.url, response.content)


if __name__ == '__main__':
    ##### 发送请求 #####
    gevent.joinall([
        gevent.spawn(fetch_async, method='get', url='https://www.python.org/', req_kwargs={}),
        gevent.spawn(fetch_async, method='get', url='https://www.yahoo.com/', req_kwargs={}),
        gevent.spawn(fetch_async, method='get', url='https://github.com/', req_kwargs={}),
    ])
协程:微线程实现异步

1111111111111

原文地址:https://www.cnblogs.com/xiaonq/p/10852365.html