中间件是Scrapy里面的一个核心概念。使用中间件可以在爬虫的请求发起之前或者请求返回之后对数据进行定制化修改,从而开发出适应不同情况的爬虫。
“中间件”这个中文名字和前面章节讲到的“中间人”只有一字之差。它们做的事情确实也非常相似。中间件和中间人都能在中途劫持数据,做一些修改再把数据传递出去。不同点在于,中间件是开发者主动加进去的组件,而中间人是被动的,一般是恶意地加进去的环节。中间件主要用来辅助开发,而中间人却多被用来进行数据的窃取、伪造甚至攻击。
在Scrapy中有两种中间件:下载器中间件(Downloader Middleware)和爬虫中间件(Spider Middleware)。
1. 下载中间件实现 ua, 代理, 携带cookie
# -*- coding: utf-8 -*- import random from scrapy.utils.project import get_project_settings from scrapy import signals import logging
logger = logging.getLogger(__name__)class ProxyMiddleware(object): def __init__(self): self.settings = get_project_settings() def process_request(self, request, spider): proxy = random.choice(self.settings["PROXIES"]) logging.info(f'use proxy {proxy}') request.meta['proxy'] = proxy class UaMiddleware(object): def __init__(self): self.settings = get_project_settings() def process_request(self, request, spider): ua = random.choice(self.settings['USER_AGENT_LIST']) logging.info(f'use ua : {ua}') request.headers['User-Agent'] = ua class CookieMiddleware(object): def __init__(self): pass def process_request(self, request, spider): request.cookies = {'_octo': 'GH1.1.308160386.1573462117', 'dotcom_user': 'zj008', 'logged_in': 'yes', '__Host-user_session_same_site': '_RswyHk7fUP475BeR1pVow6qB0XNSq5cOCfw9tUINjraeRhU', '_device_id': '69c607831d178592b4c83dde16be0f22', '_gh_sess': 'akxkaDREcUpRZFErNmRWTXYwUGovK3QrMjN3aEp4RDY3K1QwK2g4NHVQSkdYV2o5ajl4czZ6Q2dRN0ZHSXVhU3N5S1NsT1haM1hUd281Rkp5eExSQ0FQZmhwbUhSNUFCVmFQbXB1MDhCS2sxL0lhWHp5a3VWZHdiNVZia1JKbjNVNy9zVjYwdWxNcDFTdnNwUHYxaDBMTUpTMGRpR0drOElvZ0J2U3c0bjZjdisvemYvK1NGaENkM2d5UEhLTTRxY1YyWW83b2o4amljK0ZiSG4zeXJtM21sODhXS3JxdG82bG9YbENIV1oyRXUxTDI3VkE2RlNTcmRYWFoxQVBOK0QyR2tUV0Jqd1lFV0VteG9WWWo5U2dxcVNDeHNYTENDRzFhL3pvMVE1SHc9LS1VU2FHTFEycWxWZlVBZWdNcitOUWJ3PT0%3D--38d0fe27f00d3b3c34ac8ec54f6f80c4db3c2a66', 'has_recent_activity': '1', 'ignored_unsupported_browser_notice': 'false', 'user_session': '_RswyHk7fUP475BeR1pVow6qB0XNSq5cOCfw9tUINjraeRhU'}
这里我实现了三个中间件,对请求对象做了一些处理,来设置代理,设置ua,添加cookies
中间件除了process_request方法,还有process_response方法,用来处理响应对象,process_exception方法用来捕获异常
2. scrapy处理selenium
先来看一下我写的爬虫文件,这里我们抓取的是百度的搜索接口,我在settings里边维护了一个列表KEYWORDS,列表里放的是一个个要抓取的标题,还有一个MAXPAGES字段
定义start_request方法,生成请求对象发送请求
# -*- coding: utf-8 -*- import scrapy class SelenSpider(scrapy.Spider): name = 'selen' # allowed_domains = ['www.xx.com'] # start_urls = ['http://www.xx.com/'] base_url = 'https://www.baidu.com/s?wd={}&pn={}' def start_requests(self): for keyword in self.settings.get('KEYWORDS'): for page in range(1, self.settings.get('MAXPAGE') + 1): url = self.base_url.format(keyword, page*10) yield scrapy.Request(url=url, callback=self.parse) def parse(self, response): a_list = response.xpath('//div[@id="content_left"]//h3//a') for a in a_list: # print(a.xpath('./@href').extract_first()) # print(a.xpath('.//text()').extract()) item = dict() item['title'] = ''.join(a.xpath('.//text()').extract()) item['url'] = a.xpath('./@href').extract_first() print(item)
下面是管道代码
from selenium import webdriver from selenium.common.exceptions import TimeoutException from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from scrapy.http import HtmlResponse class SelenMiddleware(object): def __init__(self): self.timeout = 10 self.bro = webdriver.Chrome('/Users/aibyte/project/scrapy_project/chromedriver') self.bro.set_window_size(1400, 700) self.bro.set_page_load_timeout(self.timeout) self.wait = WebDriverWait(self.bro, self.timeout) def process_request(self, request, spider): if spider.name != 'selen': return logger.debug('谷歌浏览器启动了') try: self.bro.get(request.url) self.wait.until(EC.presence_of_element_located((By.ID, 'content_left'))) logger.debug('抓取到数据了') return HtmlResponse(url=request.url, body=self.bro.page_source, request=request, encoding='utf8', status=200) except TimeoutException: logger.debug('不知道哪里失败了抓取到数据了')
我们在init方法中初始化一个浏览器对象
请求到达process_request方法后,我们先通过爬虫名来判断是否执行selenum操作
通过浏览器对象去抓取数据
将抓到的数据封装成htmlresponse对象并返回。
3.中间件处理请求重试
对于一些请求,他的请求参数有可能有几种情况,我们带着一种请求参数取访问可能拿不到结果,对于拿不到正确结果的这种请求对应的响应,我们可以在中间件中对其截获并重新构造请求参数发起响应。
以下是爬虫代码:
通过start_request方法构造post请求,请求参数可能是今天的日期也可能是昨天的日期,我们先用昨天的日期进行请求
# -*- coding: utf-8 -*- import scrapy import datetime import json class RetrytestSpider(scrapy.Spider): name = 'retrytest' # allowed_domains = ['www.xx.com'] # start_urls = ['http://exercise.kingname.info/exercise_middleware_retry.html'] headers = { 'Content-Type': 'application/json; charset=utf-8', 'Host': 'exercise.kingname.info', 'Referer': 'http://exercise.kingname.info/exercise_middleware_retry.html', 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:70.0) Gecko/20100101 Firefox/70.0', } def start_requests(self): base_url = 'http://exercise.kingname.info/exercise_middleware_retry_backend/para/' for i in range(1, 11): url = base_url + str(i) today = datetime.date.today() - datetime.timedelta(days=1) today_str = datetime.date.strftime(today, '%Y-%m-%d') data = {'date': today_str} # print(data) # print(f'开始访问:{url}') yield scrapy.Request(url=url, method='POST', body=json.dumps(data), callback=self.parse, headers=self.headers) # yield scrapy.Request(url='http://www.baidu.com', callback=self.parse) def parse(self, response): print(response.body.decode())
以下是中间件代码:
在蜘蛛中间件的process_response方法中对响应数据进行判断,如果返回的是正常值,我们就return该响应,如果没有拿到正确的结果,我们就重新构造请求参数发起请求。
class RetryMiddleware(object): def process_requests(self, request, spider): pass def process_response(self, request, response, spider): ret = response.body.decode('utf8') if '参数错误' not in ret: logger.error('true value') return response else: dt = datetime.date.today() data = json.dumps({'date': str(dt)}) logger.error('false value') logger.error(request.url) logger.error(spider.headers) logger.error(spider.parse) logger.error(data) return scrapy.Request(method='POST', url=request.url, body=data, headers=request.headers, callback=spider.parse)