请求传参、日志等级和爬虫优化

请求传参

在某些情况下,我们爬取的数据不在同一个页面中,例如,我们爬取一个电影网站,电影的名称,评分在一级页面,而要爬取的其他电影详情在其二级子页面中。这时我们就需要用到请求传参

 案例展示:爬取http://www.55xia.com电影网,将一级页面中的电影名称,名字,评分     二级页面中的导演,演员进行爬取

爬虫文件

# -*- coding: utf-8 -*-
import scrapy

from moviePro.items import MovieproItem
class MovieSpider(scrapy.Spider):
    name = 'movie'
    # allowed_domains = ['www.xxx.com']
    start_urls = ['http://www.55xia.com/']

    def parse(self, response):
        div_list = response.xpath('//div[@class="col-xs-1-5 movie-item"]')
        for div in div_list:
            item = MovieproItem()
            item['name'] = div.xpath('.//div[@class="meta"]/h1/a/text()').extract_first()
            item['score'] = div.xpath('.//div[@class="meta"]/h1/em/text()').extract_first()
            if item['score'] == None:
                item['score'] = '0'
            detail_url = 'https:'+div.xpath('.//div[@class="meta"]/h1/a/@href').extract_first()

            #对详情页的url发请求
            #使用meta参数实现请求传参
            yield scrapy.Request(url=detail_url,callback=self.getDetailPage,meta={'item':item})

    def getDetailPage(self,response):
        item = response.meta['item']
        deactor = response.xpath('/html/body/div[1]/div/div/div[1]/div[1]/div[2]/table/tbody/tr[1]/td[2]/a/text()').extract_first()
        desc = response.xpath('/html/body/div[1]/div/div/div[1]/div[2]/div[2]/p/text()').extract_first()
        item['desc'] = desc
        item['deactor'] =deactor

        yield item


        #总结:当使用scrapy进行数据爬取的时候,如果发现爬取的数据值没有在同一张页面中进行存储.则必须使用请求传参进行处理(持久化)
movie
import scrapy


class MovieproItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    name = scrapy.Field()
    score = scrapy.Field()
    deactor = scrapy.Field()
    desc = scrapy.Field()
items
class MovieproPipeline(object):
    fp = None

    def open_spider(self, spider):
        self.fp = open('./movie.txt', 'w', encoding='utf-8')

    def process_item(self, item, spider):
        self.fp.write(item['name'] + ':' + item['score'] + ':' + item['deactor'] + ':' + item['desc'] + '
')
        return item

    def close_spider(self, spider):
        self.fp.close()
pipelines.py
# UA
USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'


ROBOTSTXT_OBEY = False
#管道
ITEM_PIPELINES = {
   'moviePro.pipelines.MovieproPipeline': 300,
}
settings.py

提升爬虫效率

1.增加并发:
    默认scrapy开启的并发线程为32个,可以适当进行增加。在settings配置文件中修改CONCURRENT_REQUESTS = 100值为100,并发设置成了为100。

2.降低日志级别:
    在运行scrapy时,会有大量日志信息的输出,为了减少CPU的使用率。可以设置log输出信息为INFO或者ERROR即可。在配置文件中编写:LOG_LEVEL = ‘INFO’

禁止cookie:
    如果不是真的需要cookie,则在scrapy爬取数据时可以进制cookie从而减少CPU的使用率,提升爬取效率。在配置文件中编写:COOKIES_ENABLED = False

禁止重试:
    对失败的HTTP进行重新请求(重试)会减慢爬取速度,因此可以禁止重试。在配置文件中编写:RETRY_ENABLED = False

减少下载超时:
    如果对一个非常慢的链接进行爬取,减少下载超时可以能让卡住的链接快速被放弃,从而提升效率。在配置文件中进行编写:DOWNLOAD_TIMEOUT = 10 超时时间为10s

爬取彼岸图网示例

爬虫程序

# -*- coding: utf-8 -*-
import scrapy
from picPro.items import PicproItem


class PicSpider(scrapy.Spider):
    name = 'pic'
    # allowed_domains = ['www.xxx.com']
    start_urls = ['http://pic.netbian.com/']

    def parse(self, response):
        li_list = response.xpath('//div[@class="slist"]/ul/li')  # 获取图片的列表
        for li in li_list:
            img_url = 'http://pic.netbian.com' + li.xpath('./a/span/img/@src').extract_first()  # 获取图片地址
            img_name = img_url.split('/')[-1]  # 获取图片名字
            item = PicproItem()
            item['name'] = img_name

            yield scrapy.Request(url=img_url, callback=self.getImgData, meta={'item': item})

    def getImgData(self, response):
        item = response.meta['item']
        item['img_data'] = response.body

        yield item
pic

管道文件

import os


class PicproPipeline(object):
    def open_spider(self, spider):
        if not os.path.exists('picLib'):
            os.mkdir('./picLib')

    def process_item(self, item, spider):
        imgPath = './picLib/' + item['name']
        with open(imgPath, 'wb') as fp:
            fp.write(item['img_data'])
            print(imgPath + '下载成功!')
        return item
pipelines.py

items

import scrapy


class PicproItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    name = scrapy.Field()
    img_data = scrapy.Field()
items

settings文件

ROBOTSTXT_OBEY = False
USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'
CONCURRENT_REQUESTS = 30  # 设置连接数
LOG_LEVEL = 'ERROR'  # 降低日志级别
COOKIES_ENABLED = False  # 禁止cookies
RETRY_ENABLED = False  # 禁止重试
DOWNLOAD_TIMEOUT = 5  # 下载超时时间
settings.py

UA池和ip代理池

首先先要看下medmiddlewaresy文件

主要是两类

Spider Middleware 

主要功能是在爬虫运行过程中进行一些处理              一般不用

       - process_spider_input 接收一个response对象并处理,

         位置是Downloader-->process_spider_input-->Spiders(Downloader和Spiders是scrapy官方结构图中的组件)

       - process_spider_exception spider出现的异常时被调用

       - process_spider_output 当Spider处理response返回result时,该方法被调用

       - process_start_requests 当spider发出请求时,被调用

    位置是Spiders-->process_start_requests-->Scrapy Engine(Scrapy Engine是scrapy官方结构图中的组件)         

Downloader Middleware

 主要功能在请求到网页后,页面被下载时进行一些处理

使用

添加UA池和ip池就在  ProxyproDownloaderMiddleware  类中  process_request 方法中添加就可以

class ProxyproDownloaderMiddleware(object):
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.
    # 拦截请求:request参数就是拦截到的请求
   # 添加 http池 和https池 PROXY_http
= [ '58.45.195.51:9000', '111.230.113.238:9999', ] PROXY_https = [ '120.83.49.90:9000', '106.14.162.110:8080', ]
    # 添加 ua池 user_agent_list
= [ "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 " "(KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1", "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 " "(KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 " "(KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 " "(KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6", "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 " "(KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 " "(KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5", "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 " "(KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 " "(KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 " "(KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24", "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 " "(KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24" ] def process_request(self, request, spider): # Called for each request that goes through the downloader # middleware. # Must either: # - return None: continue processing this request # - or return a Response object # - or return a Request object # - or raise IgnoreRequest: process_exception() methods of # installed downloader middleware will be called # 使用随机ip print('下载中间件', request) if request.url.split(':')[0] == 'http': # 如果使用的是网站开头是http的 request.meta['proxy'] = random.choice(self.PROXY_http) else: request.meta['proxy'] = random.choice(self.PROXY_https) # 使用随机UA伪装 request.headers['User-Agent'] = random.choice(self.user_agent_list) print(request.headers['User-Agent']) return None

 注意  一定开启settings的 !!!

# 开启下载中间件
DOWNLOADER_MIDDLEWARES = { 'proxyPro.middlewares.ProxyproDownloaderMiddleware': 543, }
原文地址:https://www.cnblogs.com/clbao/p/10269384.html