srcapy爬虫框架

一.什么是Srcapy?

　　Srcapy是为了爬取网站数据,提取结构性数据而编写的应用框架,非常出名,非常强悍.他就是一个已经被集成各种功能包括高性能异步下载,队列,分布式,解析,持久化等的强大通用性项目模板(超级武器霸王).主要学习它的特性,各个功能用法.

二.安装

　　Linux:pip3 install scrapy

　　Windows:

　　　　　　1.pip3 install wheel

　　　　　　2.下载twisted http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted

　　　　　　3.进入下载目录,执行 pip3 install Twisted‑17.1.0‑cp35‑cp35m‑win_amd64.whl

　　　　　　4.pip3 install pywin32

　　　　　　5.pip3 install scrapy

三.基础使用

　　1.创建项目:scrapy startproject +项目名称

　　　　项目结构:

project_name/
   scrapy.cfg：
   project_name/
       __init__.py
       items.py
       pipelines.py
       settings.py
       spiders/
           __init__.py

scrapy.cfg   项目的主配置信息。（真正爬虫相关的配置信息在settings.py文件中）
items.py    # 设置数据存储模板，用于结构化数据，如：Django的Model
pipelines   # 数据持久化处理
settings.py # 配置文件，如：递归的层数、并发数，延迟下载等
spiders     # 爬虫目录，如：创建文件，编写爬虫解析规则

　　2.创建爬虫应用程序:

cd project_name  #进入项目目录 
scrapy genspider # 应用名称 爬取网页的起始url  （例如：scrapy genspider qiubai www.qiushibaike.com）

　　3.编写爬虫文件:在步骤2执行完毕后,会在项目的spiders中生成一个应用名的爬虫文件

import scrapy

class QiubaiSpider(scrapy.Spider):
    name = 'qiubai' #应用名称
    #允许爬取的域名（如果遇到非该域名的url则爬取不到数据）
    allowed_domains = ['https://www.qiushibaike.com/']
    #起始爬取的url
    start_urls = ['https://www.qiushibaike.com/']

     #访问起始URL并获取结果后的回调函数，该函数的response参数就是向起始的url发送请求后，获取的响应对象.该函数返回值必须为可迭代对象或者NUll 
     def parse(self, response):
            pass

　　4.设置修改settings.py配置文件配置

　　　　添加自己电脑的用户代理信息;

　　　　将遵守robts协议改为False.

　　5.执行爬虫程序:scrapy crawl 应用名称,该种执行形式会显示执行的日志信息

　　　　　　　　　scrapy crawl 应用名称 --nolog:执行时不返回日志信息

　　爱之初体验:

# -*- coding: utf-8 -*-
import scrapy


class QiubaiSpider(scrapy.Spider):
    name = 'qiubai'
   # allowed_domains = ['www.qiushibaike.com']
    start_urls = ['http://www.qiushibaike.com/']

    def parse(self, response):
        info_list = []
        li_list = response.xpath('//*[@class="recmd-right"]')
        for li in li_list:
            content = li.xpath('./a/text()')[0].extract()
            author = li.xpath('./div/a/span/text()').extract_first()
            info_dic = {'content':content,
                        'author':author}
            print(info_dic)
            info_list.append(info_dic)
        return info_list

四.scrapy的持久化存储

　　1.基于终端指令的持久化存储

　　　　保证爬虫文件的parse方法中有可迭代对象列表或字典的返回,该返回值可以通过终端指令的形式指定格式的文件中进行持久化操作.但是只可以存储一下格式文件:

　　　　执行输出指定格式进行存储

    scrapy crawl 爬虫名称 -o xxx.json
    scrapy crawl 爬虫名称 -o xxx.xml
    scrapy crawl 爬虫名称 -o xxx.csv

　　2.基于管道的持久化存储

　　　　scrapy框架中已经封装好了高效便捷的持久化操作功能,我们直接使用即可.使用前先了解以下文件:items.py: 数据结构模板文件,定义数据属性

pipelines.py: 管道文件,接受数据(items),进行持久化操作

持久化流程:
    1.数据解析
    2.爬虫文件爬去到数据后,需要将数据封装到items对象中.
    3.使用yield关键字将items对象提交给pipelines管道进行持久化.
    4.在管道文件中的process_item方法中接受爬虫文件提交过来的items对象,然后编写持久化存储的代码将item对象中存储的数据进行持久化存储.
    5.settings.py配置文件中开启管道

　　　　爱之初体验:持久化存储

爬虫文件:qiubai.py

# -*- coding: utf-8 -*-
import scrapy
from papa.items import PapaItem

class QiubaiSpider(scrapy.Spider):
    name = 'qiubai'
   # allowed_domains = ['www.qiushibaike.com']
    start_urls = ['http://www.qiushibaike.com/']

    def parse(self, response):
        li_list = response.xpath('//*[@class="recmd-right"]')
        for li in li_list:
            content = li.xpath('./a/text()')[0].extract()
            author = li.xpath('./div/a/span/text()').extract_first()

            item = PapaItem()
            item['content'] = content
            item['author'] = author

            yield item

items文件:items.py

import scrapy


class PapaItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()

    author = scrapy.Field()
    content = scrapy.Field()

　管道文件:pipelines.py

class PapaPipeline(object):

    #构造方法
    def __init__(self):
        self.f = None   #定义一个文件描述属性

    #下列都是在重写父类的方法:
    #开始爬虫时,执行一次
    def open_spider(self,spider):
        print('开始爬')
        self.f = open('./data.txt','w') #因为该方法会被执行调用很多次,所以文件的开启和关闭操作写在了另外两个各自执行一次的方法中

    def process_item(self, item, spider):
        #将爬虫程序提交的item进行持久化存储
        self.f.write(item['author'] + ':' + item['content'] + '
')
        return item

    #结束爬虫时,关闭一次文件
    def close_spider(self,spider):
        self.f.close()
        print('结束爬')

　　配置文件:settings.py

ITEM_PIPELINES = {
    'papa.pipelines.PapaPipeline': 300, #300表示为优先级,值越小优先级越高
}

　　为爱鼓掌:基于redis的管道存储,在上述案例中,在管道文件里将item对象中的数据存储到磁盘中,如果将上述案例中的item数据写入redis数据库的话,只需要将上述的管道文件修改成如下形式:

pipelines.py文件

from redis import Redis
class PapaPipeline(object):
    conn = None

    def open_spider(self,spider):
        print('开始爬')
        #创建连接对象
        self.conn = Redis(host='127.0.0.1',port=6379)

    def process_item(self, item, spider):
        #将爬虫程序提交的item进行持久化存储
        dict = {
            'author':item['author'],
            'content':item['content']
        }
        self.conn.lpush('data',dict)
        return item

五.scrapy的post请求

　　默认情况下,scrapy采用的是get请求,如果想发起post请求,则需要重写start_requests方法,使其发起post请求.

import scrapy


class BbbSpider(scrapy.Spider):
    name = 'bbb'
    # allowed_domains = ['www.baidu.com']
    start_urls = ['https://baike.baidu.com/sug/']

    def start_requests(self):
        #post请求参数
        data = {
            'kw':'dog',
        }

        for url in self.start_urls:
            yield  scrapy.FormRequest(url=url,formdata=data,callback=self.parse)

    def parse(self, response):
       print(response.text)

六.scrapy的请求传参

　　在平时的情况下,我们只爬取特定的一个页面并不能满足我们的需求,比如在电影网站中不仅要看简介,还要爬取其二级网页中的详情信息,这时我们就需要用到传参.

　　spider文件

# -*- coding: utf-8 -*-
import scrapy
from zhaogongzuo.items import ZhaogongzuoItem

class A51jobSpider(scrapy.Spider):
    name = '51job'
    #allowed_domains = ['www.baidu.com']
    start_urls = ['https://search.51job.com/list/010000,000000,0000,00,9,99,%25E7%2588%25AC%25E8%2599%25AB,2,1.html?lang=c&stype=1&postchannel=0000&workyear=99&cotype=99&degreefrom=99&jobterm=99&companysize=99&lonlat=0%2C0&radius=-1&ord_field=0&confirmdate=9&fromType=4&dibiaoid=0&address=&line=&specialarea=00&from=&welfare=']

    def parse(self, response):
       div_list = response.xpath('//div[@class="el"]')
       for div in div_list:
           item = ZhaogongzuoItem()
           item['job'] = div.xpath('./p/span/a/text()').extract_first()

           item['detail_url'] = div.xpath('./p/span/a/@href').extract_first()

           item['company'] = div.xpath('./span[@class="t2"]/a/text()').extract_first()

           item['salary'] = div.xpath('./span[@class="t4"]/text()').extract_first()

           print(item)
            #meta参数:请求参数.meta字典就会传递给回调函数的response
           yield scrapy.Request(url=item['detail_url'], callback=self.parse_detail, meta={'item': item})

    def parse_detail(self,response):
       #response.meta 返回接受到的meta字典
       item = response.meta['item']
       item['detail'] = response.xpath('/html/body/div[3]/div[2]/div[3]/div[1]/div/text()').extract_first()
       print(123)
       yield item

　　　　items文件

import scrapy


class ZhaogongzuoItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    job = scrapy.Field()
    company = scrapy.Field()
    salary = scrapy.Field()
    detail = scrapy.Field()
    detail_url = scrapy.Field()

　　　　pipielines文件

import json

class ZhaogongzuoPipeline(object):
    def __init__(self):
        self.f = open('data.txt','w')
    def process_item(self, item, spider):
        dic = dict(item)
        print(dic)
        json.dump(dic,self.f,ensure_ascii=False)
        return item
    def close_spider(self,spider):
        self.f.close()

　　settings文件更改的还是那些......

七.五大核心组件工作流程

引擎(Scrapy):用来处理整个系统的数据流处理,出发事务(框架核心)

调度器(Scheduler):用来接受引擎发过来的请求,压入队列中,并在引擎再次请求的时候返回,可以想象成一个url的优先队列,由它来决定下一个要抓取的网址是什么,同时去除重复的网址.

下载器(Downloader):是所有组件中负担最大的，它用于高速地下载网络上的资源。Scrapy的下载器代码不会太复杂，但效率高，主要的原因是Scrapy下载器是建立在twisted这个高效的异步模型上的(其实整个框架都在建立在这个模型上的)。

爬虫(Spiders):用户定制自己的爬虫,用于从特定的网页中提取自己需要的信息，即所谓的实体(Item)。用户也可以从中提取出链接,让Scrapy继续抓取下一个页面。

项目管道(Pipeline):负责处理爬虫从网页中抽取的实体,主要的功能是持久化实体,验证实体的有效性,清除不需要的信息.当页面被爬虫解析后,将被发送到项目管道,并经过几个特定的次序处理数据.

八.scrapy的日志等级

　　日志:就是在终端运行项目时,在小黑框中打印输出信息.

　　日志信息的种类:

　　　　　　　　WARNING:警告;

　　　　　　　　INFO:一般的信息;

　　　　　　　　ERROR:一般错误;

　　　　　　　　DEBUG:调式信息

　　设置使用日志信息:

　　　　在settings.py配置文件中加入:LOG_LEVEL= '指定信息种类'

LOG_FILE = 'log.txt'  #表示将日志信息写入log.txt文件中

九.scrapy中selenium的使用

　　在使用scrapy框架爬取某些网站时,会碰见数据懒加载的的情况.结合selenium创建浏览器对象,通过此浏览器发送请求会迎刃而解.

　　原理:当引擎将网站url对应的请求提交给下载器后,下载器进行网页数据的下载,然后将下载到的页面数据封装到response,提交给引擎,引擎将response再转交给spiders.spiders接受到的response对象中存储的页面数据是没有动态加载的数据的.要想获取动态数据,则需要在下载中间件对下载器提交给引擎的response响应对象进行拦截,且对其内部存储的页面数据进行篡改,修改成携带了动态加载出的新闻数据,燃火将被篡改的response对象最终交给spiders进行解析操作.

　　使用流程:

　　　　　　重写爬虫文件的构造方法,在该方法中使用selenium实例化一个浏览器对象(浏览器对象只需实例化一次);

　　　　　　重写爬虫文件的closed(self,spider)方法,在其内部关闭浏览器对象.该方法是在爬虫结束时被调用;

　　　　　　在下载中间件类的process_response方法中接收spider中的浏览器对象;

　　　　　　处理执行相关自动化操作(发起请求,获取页面数据);

　　　　　　实例化一个新的响应对象(from scrapy.http import HtmlResponse),且将页面数据存储到该对象中;

　　　　　　返回新的响应对象;

　　　　　　在配置文件中开启下载中间件.

　　spider文件

import scrapy
from selenium import webdriver


class WangyiSpider(scrapy.Spider):
    name = 'wangyi'
    # allowed_domains = ['www.xxx.com']
    start_urls = ['http://news.163.com/air/']
    def __init__(self):
        self.drv = webdriver.Chrome(executable_path=r'C:UsersAirAnaconda3爬虫chromedriver.exe')

    def parse(self, response):
        div_list = response.xpath('//div[@class="data_row news_article clearfix "]')
        for div in div_list:
            title = div.xpath('.//div[@class="news_title"]/h3/a/text()').extract_first()
            print(title)

    def closed(self,spider):
        print('关闭浏览器对象!')
        self.drv.quit()

　　　　middlewas文件

from scrapy.http import HtmlResponse
from time import sleep
    def process_response(self, request, response, spider):
        # Called with the response returned from the downloader.

        # Must either;
        # - return a Response object
        # - return a Request object
        # - or raise IgnoreRequest
        print('即将返回一个新的响应对象')
        drv = spider.drv
        drv.get(url=request.url)
        sleep(3)
        #包含了动态加载出来的新闻数据
        page_text = drv.page_source
        sleep(3)

        return HtmlResponse(url=spider.drv.current_url,body=page_text,encoding='utf-8',request=request)

十.如何增加爬虫效率

增加并发:默认scrapy开启的并发线程为32个,可以适当进行增加.在settings配置文件中修改CONCURRENT_REQUESTS = 100,这样就设置成100啦.
降低日志级别:在运行scrapy时,会有大量日志信息的输出,为了减少cpu的使用频率,可以设置log输出信息为INFO或者ERROR即可.在配置文件中编写:LOG_LEVEL= 'INFO'.
禁止cookie:如果不是真的需要cookie,则在scrapy爬取数据时可以禁止cookie从而减少cpu的使用率,提升爬取效率.在配置文件中编写:COOKIES_ENABLED = False.
禁止重试:对失败的HTTP进行重试新请求(重试)会减慢爬取速度,因此可以禁止重试.在settings文件中编写:RETRY_ENABLED = False.
减少下载超时:如果对一个非常慢的链接进行爬取,减少下载超时可以能让卡主的链接快速被放弃,从而提升效率.在配置文件中编写:DOWNLOAD_TIMEOUT = 10,超过时间为10s.

十一.分布式爬虫

　　原生的的scrapy框架不可以实现分布式:因为调度器不能被共享;管道无法被共享

　　使用到的scrapy-redis组件作用:提供了可以被共享的调度器和管道.

分布式爬虫实现流程:

1.环境安装:pip install scrapy-redis
2.创建工程
3.创建爬虫文件:RedisCrawlSpider RedisSpider
        -scrapy genspider -t crawl xxx www.xxx.com
4.对爬虫文件中的相关属性进行修改:
    导包:from scrapy_redis.spiders import RedisCrawlSpider
    将当前爬虫文件的父类设置成RedisCrawlSpider
    将起始url列表替换成redis_key = 'xxx' (调度器队列的名称)
5.在配置文件中进行配置
    使用组件中封装好的可以被共享的管道类:
    ITEM_PIPELINES = {
                'scrapy_redis.pipelines.RedisPipeline':400
            }  
    配置调度器(使用组件中封装好的可以被共享的调度器):
    #增加了一个去重容器类的配置,作用使用Redis的set集合来储存请求的指纹数据,从而实现请求去重的持久化.
    DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFDupeFilter"
    #使用scrapy-redis组件自己的调度器
    SCHEDULER = "scrapy_redis.scheduler.Scheduler"
    #配置调度器是否要持久化,也就是当爬虫结束了要不要清空redis中请求列队和去重指纹的set.如果是True,就表示需要持久化存储,就不清空数据,否则清空数据
    SCHEDULER_PERSITE= True
    #指定存储数据的redis
    REDIS_HOST = 'redis服务的ip地址'
    REDIS_POST = 6379
    #配置redis数据库的配置文件
       -取消保护模式:protected-mode no
       -bind 绑定:#bind 127.0.0.1
    -启动redis
6.执行分布式程序
　　scrapy runspider xxx.py
7.向调度器队列中写入一个起始url:
　　在redis-cli中执行

十二.增量式爬虫

　　什么是增量式爬虫:通过爬虫程序检测某网站数据更新的情况,以变爬到该网站更新出的新数据.

　　如何进行:(1) 在发送请求之前判断这个url是不是之前爬过

　　　　　　 (2) 在解析内容后判断这部分内容是不是之前爬过.

　　　　　　 (3) 写入存储介质时判断内容是不是已经在介质中存在.

　　　　　　　　　　分析:可以发现,其实增量爬取的核心就是去重,至于去重的操作在哪步起作用,只能说各有利弊.建议(1)(2)思路根据实际情况取一个(也可能都用).(1)思路适合不断有新页面出现的网站,比如说小说的新章节,每天的最新新闻等等;(2)思路则适合页面内容会更新的网站;(3)思路相当于是最后一道防线,这样做可以最大程度上达到去重的目的.

　　去重方法:

　　　　　　将爬取过程中产生的url进行存储,存储在redis中的set中.当下次进行数据爬取时,首先对即将发起的请求对应url在存储的url的set中做判断,如果存在则不进行请求,否则才请求;

　　　　　　对爬取到的网页内容进行唯一标识的制定,然后将该唯一标识存储至redis的set中,当下次爬到网页数据的时候,在进行持久化存储之前,首先可以先判断该数据的唯一标识在redis的set中是否存在,在决定是否进行持久化存储.

爬虫文件:

from redis import Redis
class IncrementproPipeline(object):
    conn = None
    def open_spider(self,spider):
        self.conn = Redis(host='127.0.0.1',port=6379)
    def process_item(self, item, spider):
        dic = {
            'name':item['name'],
            'kind':item['kind']
        }
        print(dic)
        self.conn.lpush('movieData',dic)
        return item
- 需求：爬取糗事百科中的段子和作者数据。

爬虫文件：

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from incrementByDataPro.items import IncrementbydataproItem
from redis import Redis
import hashlib
class QiubaiSpider(CrawlSpider):
    name = 'qiubai'
    # allowed_domains = ['www.xxx.com']
    start_urls = ['https://www.qiushibaike.com/text/']

    rules = (
        Rule(LinkExtractor(allow=r'/text/page/d+/'), callback='parse_item', follow=True),
        Rule(LinkExtractor(allow=r'/text/$'), callback='parse_item', follow=True),
    )
    #创建redis链接对象
    conn = Redis(host='127.0.0.1',port=6379)
    def parse_item(self, response):
        div_list = response.xpath('//div[@id="content-left"]/div')

        for div in div_list:
            item = IncrementbydataproItem()
            item['author'] = div.xpath('./div[1]/a[2]/h2/text() | ./div[1]/span[2]/h2/text()').extract_first()
            item['content'] = div.xpath('.//div[@class="content"]/span/text()').extract_first()

            #将解析到的数据值生成一个唯一的标识进行redis存储
            source = item['author']+item['content']
            source_id = hashlib.sha256(source.encode()).hexdigest()
            #将解析内容的唯一表示存储到redis的data_id中
            ex = self.conn.sadd('data_id',source_id)

            if ex == 1:
                print('该条数据没有爬取过，可以爬取......')
                yield item
            else:
                print('该条数据已经爬取过了，不需要再次爬取了!!!')

　　管道文件:

from redis import Redis
class IncrementbydataproPipeline(object):
    conn = None

    def open_spider(self, spider):
        self.conn = Redis(host='127.0.0.1', port=6379)

    def process_item(self, item, spider):
        dic = {
            'author': item['author'],
            'content': item['content']
        }
        # print(dic)
        self.conn.lpush('qiubaiData', dic)
        return item