请求传参、日志等级和爬虫优化

请求传参

在某些情况下，我们爬取的数据不在同一个页面中，例如，我们爬取一个电影网站，电影的名称，评分在一级页面，而要爬取的其他电影详情在其二级子页面中。这时我们就需要用到请求传参

案例展示：爬取http://www.55xia.com电影网，将一级页面中的电影名称，名字，评分二级页面中的导演，演员进行爬取

爬虫文件

# -*- coding: utf-8 -*-
import scrapy

from moviePro.items import MovieproItem
class MovieSpider(scrapy.Spider):
    name = 'movie'
    # allowed_domains = ['www.xxx.com']
    start_urls = ['http://www.55xia.com/']

    def parse(self, response):
        div_list = response.xpath('//div[@class="col-xs-1-5 movie-item"]')
        for div in div_list:
            item = MovieproItem()
            item['name'] = div.xpath('.//div[@class="meta"]/h1/a/text()').extract_first()
            item['score'] = div.xpath('.//div[@class="meta"]/h1/em/text()').extract_first()
            if item['score'] == None:
                item['score'] = '0'
            detail_url = 'https:'+div.xpath('.//div[@class="meta"]/h1/a/@href').extract_first()

            #对详情页的url发请求
            #使用meta参数实现请求传参
            yield scrapy.Request(url=detail_url,callback=self.getDetailPage,meta={'item':item})

    def getDetailPage(self,response):
        item = response.meta['item']
        deactor = response.xpath('/html/body/div[1]/div/div/div[1]/div[1]/div[2]/table/tbody/tr[1]/td[2]/a/text()').extract_first()
        desc = response.xpath('/html/body/div[1]/div/div/div[1]/div[2]/div[2]/p/text()').extract_first()
        item['desc'] = desc
        item['deactor'] =deactor

        yield item


        #总结:当使用scrapy进行数据爬取的时候,如果发现爬取的数据值没有在同一张页面中进行存储.则必须使用请求传参进行处理(持久化)

movie

import scrapy


class MovieproItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    name = scrapy.Field()
    score = scrapy.Field()
    deactor = scrapy.Field()
    desc = scrapy.Field()

items

class MovieproPipeline(object):
    fp = None

    def open_spider(self, spider):
        self.fp = open('./movie.txt', 'w', encoding='utf-8')

    def process_item(self, item, spider):
        self.fp.write(item['name'] + ':' + item['score'] + ':' + item['deactor'] + ':' + item['desc'] + '
')
        return item

    def close_spider(self, spider):
        self.fp.close()

pipelines.py

# UA
USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'


ROBOTSTXT_OBEY = False
#管道
ITEM_PIPELINES = {
   'moviePro.pipelines.MovieproPipeline': 300,
}

settings.py

提升爬虫效率

1.增加并发：
    默认scrapy开启的并发线程为32个，可以适当进行增加。在settings配置文件中修改CONCURRENT_REQUESTS = 100值为100,并发设置成了为100。

2.降低日志级别：
    在运行scrapy时，会有大量日志信息的输出，为了减少CPU的使用率。可以设置log输出信息为INFO或者ERROR即可。在配置文件中编写：LOG_LEVEL = ‘INFO’

禁止cookie：
    如果不是真的需要cookie，则在scrapy爬取数据时可以进制cookie从而减少CPU的使用率，提升爬取效率。在配置文件中编写：COOKIES_ENABLED = False

禁止重试：
    对失败的HTTP进行重新请求（重试）会减慢爬取速度，因此可以禁止重试。在配置文件中编写：RETRY_ENABLED = False

减少下载超时：
    如果对一个非常慢的链接进行爬取，减少下载超时可以能让卡住的链接快速被放弃，从而提升效率。在配置文件中进行编写：DOWNLOAD_TIMEOUT = 10 超时时间为10s

爬取彼岸图网示例

爬虫程序

# -*- coding: utf-8 -*-
import scrapy
from picPro.items import PicproItem


class PicSpider(scrapy.Spider):
    name = 'pic'
    # allowed_domains = ['www.xxx.com']
    start_urls = ['http://pic.netbian.com/']

    def parse(self, response):
        li_list = response.xpath('//div[@class="slist"]/ul/li')  # 获取图片的列表
        for li in li_list:
            img_url = 'http://pic.netbian.com' + li.xpath('./a/span/img/@src').extract_first()  # 获取图片地址
            img_name = img_url.split('/')[-1]  # 获取图片名字
            item = PicproItem()
            item['name'] = img_name

            yield scrapy.Request(url=img_url, callback=self.getImgData, meta={'item': item})

    def getImgData(self, response):
        item = response.meta['item']
        item['img_data'] = response.body

        yield item

pic

管道文件

import os


class PicproPipeline(object):
    def open_spider(self, spider):
        if not os.path.exists('picLib'):
            os.mkdir('./picLib')

    def process_item(self, item, spider):
        imgPath = './picLib/' + item['name']
        with open(imgPath, 'wb') as fp:
            fp.write(item['img_data'])
            print(imgPath + '下载成功!')
        return item

pipelines.py

items

import scrapy


class PicproItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    name = scrapy.Field()
    img_data = scrapy.Field()

items

settings文件

ROBOTSTXT_OBEY = False
USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'
CONCURRENT_REQUESTS = 30  # 设置连接数
LOG_LEVEL = 'ERROR'  # 降低日志级别
COOKIES_ENABLED = False  # 禁止cookies
RETRY_ENABLED = False  # 禁止重试
DOWNLOAD_TIMEOUT = 5  # 下载超时时间

settings.py

UA池和ip代理池

首先先要看下medmiddlewaresy文件

主要是两类

Spider Middleware

主要功能是在爬虫运行过程中进行一些处理一般不用

- process_spider_input 接收一个response对象并处理,

位置是Downloader-->process_spider_input-->Spiders(Downloader和Spiders是scrapy官方结构图中的组件)

- process_spider_exception spider出现的异常时被调用

- process_spider_output 当Spider处理response返回result时,该方法被调用

- process_start_requests 当spider发出请求时,被调用

　　位置是Spiders-->process_start_requests-->Scrapy Engine(Scrapy Engine是scrapy官方结构图中的组件)

Downloader Middleware

主要功能在请求到网页后,页面被下载时进行一些处理

使用

添加UA池和ip池就在 ProxyproDownloaderMiddleware 类中 process_request 方法中添加就可以

class ProxyproDownloaderMiddleware(object):
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.
    # 拦截请求:request参数就是拦截到的请求
　　 # 添加  http池 和https池
    PROXY_http = [
        '58.45.195.51:9000',
        '111.230.113.238:9999',
    ]
    PROXY_https = [
        '120.83.49.90:9000',
        '106.14.162.110:8080',
    ]
　　　　# 添加 ua池
    user_agent_list = [
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 "
        "(KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
        "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 "
        "(KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 "
        "(KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 "
        "(KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
        "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 "
        "(KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 "
        "(KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
        "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 "
        "(KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 "
        "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
        "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 "
        "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 "
        "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 "
        "(KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 "
        "(KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 "
        "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 "
        "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 "
        "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 "
        "(KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 "
        "(KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
        "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 "
        "(KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
    ]

    
    def process_request(self, request, spider):
        # Called for each request that goes through the downloader
        # middleware.

        # Must either:
        # - return None: continue processing this request
        # - or return a Response object
        # - or return a Request object
        # - or raise IgnoreRequest: process_exception() methods of
        #   installed downloader middleware will be called

        # 使用随机ip
        print('下载中间件', request)
        if request.url.split(':')[0] == 'http':  # 如果使用的是网站开头是http的
            request.meta['proxy'] = random.choice(self.PROXY_http)
        else:
            request.meta['proxy'] = random.choice(self.PROXY_https)

        # 使用随机UA伪装
        request.headers['User-Agent'] = random.choice(self.user_agent_list)

        print(request.headers['User-Agent'])
        return None

注意一定开启settings的！！！

# 开启下载中间件
DOWNLOADER_MIDDLEWARES = {
   'proxyPro.middlewares.ProxyproDownloaderMiddleware': 543,
}