爬虫基础知识及scrapy框架使用和基本原理

爬虫

一、异步IO

线程:线程是计算机中工作的最小单元

​ IO请求(IO密集型)时多线程更好,计算密集型进程并发最好,IO请求不涉及CPU

自定义线程池

进程:进程默认有主线程,可以有多线程共存,并且共享内部资源

自定义进程

协程:使用进程中一个线程去完成多个任务,微线程(伪线程)

GIL:python特有,用于在进程中对线程枷锁,保证同一时刻只能有一个线程被CPU调度

# Author:wylkjj
# Date:2020/2/24
# -*- coding:utf-8 -*-
import requests
# 创建多线程
from concurrent.futures import ThreadPoolExecutor
# 创建多进程
from concurrent.futures import ProcessPoolExecutor


def async_url(url):
    try:
        response = requests.get(url)
    except Exception as e:
        print('异常结果', response.url, response.content)
    print('获取结果', response.url, response.content)


url_list = [
    'http://www.baidu.com',
    'http://www.chouti.com',
    'http://www.bing.com',
    'http://www.google.com',
]
# 线程池pool:创建五个线程,IO请求线程更适合
# GIL线程锁,只针对cpu的调用权限,针对IO请求不会锁住
pool = ThreadPoolExecutor(5)
# 进程池pools:创建五个线程,进程浪费资源
pools = ProcessPoolExecutor(5)

for url in url_list:
    print('开始请求:', url)
    pool.submit(async_url, url)

pool.shutdown(wait=True)

# 回调函数:.add_done_callback(回调的函数)

异步IO模块:

import asyncio缺点:只提供TCP,提供sleep,不提供http

​ 事件循环:get_event_loop()

​ @asyncio.coroutine和yield from要同时配套使用,固定写法

​ 异步IO:

  • asynico + aiohttp:asynico + request
  • gevent + request:gevent + request两个方法组合在一起后出现了一个grequests
  • twisted
  • tornado:异步非阻塞IO
# Author:wylkjj
# Date:2020/2/24
# -*- coding:utf-8 -*-
# 异步IO模块
import asyncio


@asyncio.coroutine
def func1():
    print('before...func1......')
    yield from asyncio.sleep(5)
    print('end...func1......')


tasks = [func1(), func1()]
loop = asyncio.get_event_loop()  # 事件循环
loop.run_until_complete(asyncio.gather(*tasks))  # 把任务作为列表传进来
loop.close()

# Author:wylkjj
# Date:2020/2/25
# -*- coding:utf-8 -*-
import asyncio


@asyncio.coroutine
def fetch_async(host, url='/'):
    print(host, url)
    reader, writer = yield from asyncio.open_connection(host, 80)

    request_header_content = """GET %s HTTP/1.0
Host: %s

""" % (url, host,)
    request_header_content = bytes(request_header_content, encoding='utf-8')

    writer.write(request_header_content)
    yield from writer.drain()
    text = yield from reader.read()
    print(host, url, str(text, encoding='utf-8'))
    writer.close()

tasks = [
    fetch_async('www.cnblogs.com', '/eric/'),
    fetch_async('dig.chouti.com', '/pic/show?nid=4073644713430508&lid=10273091')
]

loop = asyncio.get_event_loop()
results = loop.run_until_complete(asyncio.gather(*tasks))
loop.close()

# Author:wylkjj
# Date:2020/2/25
# -*- coding:utf-8 -*-
# 使用aiohttp和asyncio实现http请求 (aiohttp亲)
import aiohttp
import asyncio


@asyncio.coroutine
def fetch_async(url):
    print(url)
    response = yield from aiohttp.request('GET', url)
    # data = yield from response.read()
    # print(url, data)
    print(url, response)
    response.close()
 

# Author:wylkjj
# Date:2020/2/25
# -*- coding:utf-8 -*-
# asyncio和requests配合使用也可以支持HTTP (requests后)
import asyncio
import requests


@asyncio.coroutine
def fetch_async(func, *args):
    print(args)
    # 事件循环
    loop = asyncio.get_event_loop()
    future = loop.run_in_executor(None, func, *args)
    response = yield from future
    print(response.url, response.content)


tasks = [
    fetch_async(requests.get, 'http://www.cnblogs.com/eric/'),
    fetch_async(requests.get, 'http://dig.chouti.com/pic/show?nid=4073644713430508&lid=10273091')
]

loop = asyncio.get_event_loop()
results = loop.run_until_complete(asyncio.gather(*tasks))
loop.close()


# Author:wylkjj
# Date:2020/2/25
# -*- coding:utf-8 -*-
import gevent
from gevent import monkey
monkey.patch_all()

import requests


def fetch_async(method, url, req_kwargs):
    print(method, url, req_kwargs)
    response = requests.request(method=method, url=url, **req_kwargs)
    print(response.url, response.content)


# ##### 发送请求 #####
gevent.joinall([
    gevent.spawn(fetch_async, method='get', url='https://www.python.org/', req_kwargs={}),
    gevent.spawn(fetch_async, method='get', url='https://www.yahoo.com/', req_kwargs={}),
    gevent.spawn(fetch_async, method='get', url='https://github.com/', req_kwargs={}),
])
# pip3 install twisted
# pip3 install wheel
#       b. 下载twisted http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted
#       c. 进入下载目录,执行 pip3 install Twisted‑17.1.0‑cp35‑cp35m‑win_amd64.whl


from twisted.web.client import getPage
from twisted.internet import reactor

REV_COUNTER = 0
REQ_COUNTER = 0

def callback(contents):
    print(contents,)

    global REV_COUNTER
    REV_COUNTER += 1
    if REV_COUNTER == REQ_COUNTER:
        reactor.stop()


url_list = ['http://www.bing.com', 'http://www.baidu.com', ]
REQ_COUNTER = len(url_list)
for url in url_list:
    print(url)
    deferred = getPage(bytes(url, encoding='utf8'))
    deferred.addCallback(callback)
reactor.run()

import socket:它提供了标准的 BSD Sockets API,可以访问底层操作系统Socket接口的全部方法。

tronado框架原理

自定义异步IO:
基于socket,setblocking(False)
IO多路复用(也是同步IO)
while True:
r,w,e = select.select([ ],[ ],[ ],1)

关于IO的详情博客:事件驱动IO模型:https://www.cnblogs.com/wylshkjj/p/10896994.html

二、scrapy框架

scrapy框架的安装

​ Linux
pip3 install scrapy
​ Windows
​ 1.
​ pip3 install wheel
​ 安装Twisted:版本信息知识一个格式,非正确版本
​ a. http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted, 下载:Twisted-19.1.0-cp37-cp37m-win_amd64.whl
​ b. 进入文件所在目录
​ c. pip3 install Twisted-19.1.0-cp37-cp37m-win_amd64.whl

​ 2.
pip3 install scrapy:,此版本与urllib3模块产生冲突,如有此模块需要先卸载此模块
​ 3.
​ windows上scrapy依赖 https://sourceforge.net/projects/pywin32/files/

项目的创建和执行

  1. scrapy使用方法
  2. 创建新项目命令:scrapy startproject scy (在想要创建的目录中执行此命令,scy是项目名)
  3. 创建一个爬虫:scrapy genspider example example.com (创建爬虫要先cd 到项目的目录中,example是爬虫文件名字,example.com 是所爬网页地址)
  4. 项目的执行命令:scrapy crawl chouti (抽屉是所要执行的爬虫文件)
  5. 过滤日志命令:scrapy crawl chouti --nolog (过滤chouti 爬的数据日志)
  6. 查看爬虫模板命令:scrapy genspider --list(显示四个模板:basic,crawl,csvfeed,xmlfeed)
  7. 防止蜘蛛(genspider )的权限,robkts.txt属性,在项目setting配置文件中修改ROBOTSTXT_OBEY属性使其值为ROBOTSTXT_OBEY=False
  8. project_name/
    • scrapy.cfg 项目的主配置文件
    • project_name/
      • __init__.py
      • items.py 设置数据存储模板,用于结构化数据,如:Django的Model
      • pipelines.py 数据处理行为,如:一般结构化的数据持久化
      • settings.py 真正配置文件,如:递归的层数,并发数,延迟下载等
      • spiders/ 爬虫目录,如:创建文件,编写爬虫规则
        • __init__.py
        • 爬虫1.py
        • 爬虫2.py
  9. 注意:创建爬虫还是要在命令行创建,运行项目,运行爬虫文件都要在命令行执行
# 部分项目代码展示,爬取优美图库图片
# -*- coding: utf-8 -*-
import scrapy
from scrapy.http import Request
from bs4 import BeautifulSoup


class UmeiSpider(scrapy.Spider):
    name = 'umei'
    allowed_domains = ['umei.cc']
    start_urls = ['https://www.umei.cc/meinvtupian/meinvxiezhen/1.htm']
    visited_set = set()

    def parse(self, response):
        self.visited_set.add(response.url)  # 已经爬取的网页
        # 1.将当前页所有的meizi图片爬下来
        # 获取a标签并且属性为 class = TypeBigPics
        main_page = BeautifulSoup(response.text, "html.parser")
        item_list = main_page.find_all("a", attrs={'class': 'TypeBigPics'})
        for item in item_list:
            item = item.find_all("img",)
            print(item)

        # 2.获取:https://www.umei.cc/meinvtupian/meinvxiezhen/(d+).htm
        page_list = main_page.find_all("div", attrs={'class': 'NewPages'})
        a_urls = 'https://www.umei.cc/meinvtupian/meinvxiezhen/'
        a_list = page_list[0].find_all("a")
        a_href = set()
        for a in a_list:
            a = a.get('href')
            if a:
                a_href.add(a_urls+a)
            else:
                pass
        for i in a_href:
            if i in self.visited_set:
                pass
            else:
                obj = Request(url=i, method='GET', callback=self.parse)
                yield obj
                print("obj:", obj)
原文地址:https://www.cnblogs.com/wylshkjj/p/12365770.html