爬虫之分布式爬虫和增量爬虫

一.基于scrapy-redis两种形式的分布式爬虫

 
1.scrapy框架是否可以自己实现分布式?
    - 不可以。原因有二。
      其一:因为多台机器上部署的scrapy会各自拥有各自的调度器,这样就使得多台机器无法分配start_urls列表中的url。(多台机器无法共享同一个调度器)
      其二:多台机器爬取到的数据无法通过同一个管道对数据进行统一的数据持久出存储。(多台机器无法共享同一个管道)
2.基于scrapy-redis组件的分布式爬虫
        - scrapy-redis组件中为我们封装好了可以被多台机器共享的调度器和管道,我们可以直接使用并实现分布式数据爬取。
        - 实现方式:
            1.基于该组件的RedisSpider类
            2.基于该组件的RedisCrawlSpider类
3.分布式实现流程:上述两种不同方式的分布式实现流程是统一的:
  (1)爬虫文件的配置
     1.导入模块 from scrapy_redis import RedisLSpider
     2.使爬虫文件中的爬虫类继承RedisSpider 类
     3.将allowed_domains 和 start_urls删除
     4.添加一个属性redis_key = xxx,这个属性表示的是共享调度器中的队列名称
  (2)配置文件的配置
     1.保证爬虫文件提交的数据都会被提交到共享的调度器队列中
            使用scrapy-redis组件的去重队列,增加了一个去重容器类的配置, 作用使用Redis的set集合来存储请求的指纹数据, 从而实现请求去重的持久化
      DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
            使用scrapy-redis组件自己的调度器
      SCHEDULER = "scrapy_redis.scheduler.Scheduler"
            配置调度器是否要持久化, 也就是当爬虫结束了, 要不要清空Redis中请求队列和去重指纹的set。如果是True, 就表示要持久化存储, 就不清空数据, 否则清空数据
       SCHEDULER_PERSIST = True
     2.保证爬虫文件提交的item会被存储到可以被共享的管道中
        ITEM_PIPELINES = {
         'scrapy_redis.pipelines.RedisPipeline': 400
        }
     3.配置最终数据要存储的redis
            REDIS_HOST = 'redis服务的ip地址'
            REDIS_PORT = 6379
            REDIS_ENCODING = ‘utf - 8’
     4.redis配置文件的设置:关闭保护模式,bind为0.0.0.0
     5.开启redis服务端和客户端
     6. 执行爬虫文件: scrapy runspider xxx.py  
 
项目实例
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from scrapy_redis.spiders import RedisCrawlSpider
from choutiDemo.items import ChoutidemoItem
 
class ChoutiSpider(RedisCrawlSpider):
    name = 'chouti'
    # allowed_domains = ['www.xxx.com']
    # start_urls = ['http://www.xxx.com/']
    redis_key = 'chouti'
    link = LinkExtractor(allow=r'/r/scoff/hot/d+')
    rules = (
        Rule(link, callback='parse_item', follow=True),
    )
 
    def parse_item(self, response):
        div_list = response.xpath('//*[@id="content-list"]/div')
        for div in div_list:
            item = ChoutidemoItem()
            title = div.xpath('./div[3]/div[1]/a/text() | ./div[2]/div[1]/a/text()').extract_first().strip().replace(
                ' ', '')
            author = div.xpath(
                './div[3]/div[2]/a[4]/b/text()| ./div[2]/div[2]/a[4]/b/text() | ./div[3]/div[3]/a[4]/b/text()').extract_first()
            print(title)
            print(author)
            item['title'] = title
            item['author'] = author
            yield item

二. 增量爬虫

 
1.增量爬虫的实现思路:
    1.将url存入redis中,通过返回值来判断url是否存在,存在就不再进行爬取,不存在则进行数据爬取
        2.将爬取到的数据存在redis中,通过返回值来判断data是否存在,存在就不再进行爬取,不存在则进行数据爬取
 
1代码实例
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from movieDemo.items import MoviedemoItem
from redis import Redis
class MovieSpider(CrawlSpider):
    name = 'movie'
    # allowed_domains = ['www.xxx.com']
    start_urls = ['http://www.4567tv.tv/frim/index1.html']
 
    rules = (
        Rule(LinkExtractor(allow=r'/frim/index1-d+.html'), callback='parse_item', follow=False),
    )
    conn = Redis(host='127.0.0.1', port=6380, )
    def parse_item(self, response):
        li_list = response.xpath('//ul/li[@class="p1 m1"]')
        # response.xpath('/html/body/metaname="robots"content="noarchive"/div[3]/div[3]/ul/li[3]/a')
        # print(len(li_list))
        for li in li_list:
            detail_url = 'http://www.4567tv.tv' + li.xpath('./a/@href').extract_first()
            ex = self.conn.sadd('urls', detail_url)
            if ex:
                yield scrapy.Request(url=detail_url, callback=self.parst_detail)
            # print(detail_url)
            else:
                print('已经有这个url')
 
    def parst_detail(self, response):
        item = MoviedemoItem()
        title = response.xpath('//div[3]/h1/a[3]/text()').extract_first()
        kind = response.xpath('//div[3]/div[1]/div[2]/dl/dt[4]/a/text()').extract_first()
        item['title'] = title
        item['kind'] = kind
        print(title + ':' + kind)
        yield item
 
2.代码实例
 
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from redis import Redis
from qiubaiDemo.items import QiubaidemoItem
import hashlib
 
class QiubaiSpider(CrawlSpider):
    name = 'qiubai'
    # allowed_domains = ['www.xx.com']
    start_urls = ['https://www.qiushibaike.com/text/']
    conn = Redis(host='127.0.0.1', port=6380)
    rules = (
        Rule(LinkExtractor(allow=r'/text/page/d+/'), callback='parse_item', follow=False),
        # Rule(LinkExtractor(allow=r'/text/$'), callback='parse_item', follow=False),
    )
 
    def parse_item(self, response):
        print(response)
        div_list = response.xpath('//div[@id="content-left"]/div')
        for div in div_list:
            item = QiubaidemoItem()
            title = div.xpath('./a[1]/div/span/text()').extract_first()
            author = div.xpath('./div[1]/a[2]/h2/text() | ./div[1]/span[2]/h2/text()').extract_first()
            print(title )
            print(author)
            item['title'] = title
            item['author'] = author
            source = item['title'] + item['author']
            # 对获取的数据进行加密
            sha3 = hashlib.sha3_256()
            sha3.update(source.encode('utf-8'))
            s = sha3.hexdigest()
            ex = self.conn.sadd('data', s)
            if ex:
                yield item
            else:
                print('该条数据已经爬取过了,不需要再次爬取了!!!')
原文地址:https://www.cnblogs.com/hu13/p/9300010.html