CrawlSpider 用法(页面链接提取解析 例如:下一页)

创建基于CrawlSpider的爬虫文件

  scrapy genspider -t crawl 爬虫名称 链接

注意follow参数

例1:follow = False

spider/chouti.py

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class ChoutiSpider(CrawlSpider):
    name = 'chouti'
    allowed_domains = ['dig.chouti.com']
    start_urls = ['https://dig.chouti.com/']
    # 实例化一个链接提取器对象
    # 链接提取器:用来提取指定的链接(url)
    # allow参数:赋值一个正则表达式
    # 链接提取器可以根据正则表达式在页面中提取指定的链接
    # 提取到的链接会全部交给规则解析器
    link = LinkExtractor(allow=r'/all/hot/recent/d+')
    rules = (
        # 实例话一个规则解析器
        # 规则解析器在接收链接提起器发送的链接后,就会对链接发起请求,获取链接对应的页面内容
        # callback:指定一个解析规则(方法/函数)
        # follow:是否将链接提取器继续作用到链接提取器已经提取出的页面数据中
        Rule(link, callback='parse_item', follow=False),
    )

    def parse_item(self, response):
        print(response)

执行结果 : 没有允许链接提取器继续在提取到的链接中继续作用

C:UsersAdministratorPycharmProjects
ewCrawlspiderPro>scrapy crawl chouti --nolog
<200 https://dig.chouti.com/all/hot/recent/1>
<200 https://dig.chouti.com/all/hot/recent/3>
<200 https://dig.chouti.com/all/hot/recent/9>
<200 https://dig.chouti.com/all/hot/recent/6>
<200 https://dig.chouti.com/all/hot/recent/2>
<200 https://dig.chouti.com/all/hot/recent/4>
<200 https://dig.chouti.com/all/hot/recent/10>
<200 https://dig.chouti.com/all/hot/recent/7>
<200 https://dig.chouti.com/all/hot/recent/8>
<200 https://dig.chouti.com/all/hot/recent/5>

例2:

follow = True

spider/chouti.py

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class ChoutiSpider(CrawlSpider):
    name = 'chouti'
    allowed_domains = ['dig.chouti.com']
    start_urls = ['https://dig.chouti.com/']
    # 实例化一个链接提取器对象
    # 链接提取器:用来提取指定的链接(url)
    # allow参数:赋值一个正则表达式
    # 链接提取器可以根据正则表达式在页面中提取指定的链接
    # 提取到的链接会全部交给规则解析器
    link = LinkExtractor(allow=r'/all/hot/recent/d+')
    rules = (
        # 实例话一个规则解析器
        # 规则解析器在接收链接提起器发送的链接后,就会对链接发起请求,获取链接对应的页面内容
        # callback:指定一个解析规则(方法/函数)
        # follow:是否将链接提取器继续作用到链接提取器已经提取出的页面数据中
        Rule(link, callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        print(response)

执行结果

C:UsersAdministratorPycharmProjects
ewCrawlspiderPro>scrapy crawl chouti --nolog
<200 https://dig.chouti.com/all/hot/recent/1>
<200 https://dig.chouti.com/all/hot/recent/3>
<200 https://dig.chouti.com/all/hot/recent/5>
<200 https://dig.chouti.com/all/hot/recent/2>
<200 https://dig.chouti.com/all/hot/recent/10>
<200 https://dig.chouti.com/all/hot/recent/4>
<200 https://dig.chouti.com/all/hot/recent/6>
<200 https://dig.chouti.com/all/hot/recent/7>
<200 https://dig.chouti.com/all/hot/recent/8>
<200 https://dig.chouti.com/all/hot/recent/9>
<200 https://dig.chouti.com/all/hot/recent/13>
<200 https://dig.chouti.com/all/hot/recent/14>
<200 https://dig.chouti.com/all/hot/recent/11>
<200 https://dig.chouti.com/all/hot/recent/12>
<200 https://dig.chouti.com/all/hot/recent/16>
<200 https://dig.chouti.com/all/hot/recent/17>
<200 https://dig.chouti.com/all/hot/recent/15>
<200 https://dig.chouti.com/all/hot/recent/18>
<200 https://dig.chouti.com/all/hot/recent/20>
<200 https://dig.chouti.com/all/hot/recent/19>
<200 https://dig.chouti.com/all/hot/recent/22>
<200 https://dig.chouti.com/all/hot/recent/21>
<200 https://dig.chouti.com/all/hot/recent/24>
<200 https://dig.chouti.com/all/hot/recent/23>
<200 https://dig.chouti.com/all/hot/recent/26>
<200 https://dig.chouti.com/all/hot/recent/25>
<200 https://dig.chouti.com/all/hot/recent/28>
<200 https://dig.chouti.com/all/hot/recent/27>
<200 https://dig.chouti.com/all/hot/recent/30>
<200 https://dig.chouti.com/all/hot/recent/29>
<200 https://dig.chouti.com/all/hot/recent/31>
<200 https://dig.chouti.com/all/hot/recent/32>
<200 https://dig.chouti.com/all/hot/recent/33>
<200 https://dig.chouti.com/all/hot/recent/34>
<200 https://dig.chouti.com/all/hot/recent/37>
<200 https://dig.chouti.com/all/hot/recent/36>
<200 https://dig.chouti.com/all/hot/recent/38>
<200 https://dig.chouti.com/all/hot/recent/35>
<200 https://dig.chouti.com/all/hot/recent/40>
<200 https://dig.chouti.com/all/hot/recent/41>
<200 https://dig.chouti.com/all/hot/recent/39>
<200 https://dig.chouti.com/all/hot/recent/42>
<200 https://dig.chouti.com/all/hot/recent/45>
<200 https://dig.chouti.com/all/hot/recent/43>
<200 https://dig.chouti.com/all/hot/recent/44>
<200 https://dig.chouti.com/all/hot/recent/46>
<200 https://dig.chouti.com/all/hot/recent/49>
<200 https://dig.chouti.com/all/hot/recent/48>
<200 https://dig.chouti.com/all/hot/recent/47>
<200 https://dig.chouti.com/all/hot/recent/50>
<200 https://dig.chouti.com/all/hot/recent/51>
<200 https://dig.chouti.com/all/hot/recent/52>
<200 https://dig.chouti.com/all/hot/recent/53>
<200 https://dig.chouti.com/all/hot/recent/54>
<200 https://dig.chouti.com/all/hot/recent/55>
<200 https://dig.chouti.com/all/hot/recent/56>
<200 https://dig.chouti.com/all/hot/recent/58>
<200 https://dig.chouti.com/all/hot/recent/57>
<200 https://dig.chouti.com/all/hot/recent/60>
<200 https://dig.chouti.com/all/hot/recent/59>
<200 https://dig.chouti.com/all/hot/recent/61>
<200 https://dig.chouti.com/all/hot/recent/62>
<200 https://dig.chouti.com/all/hot/recent/64>
<200 https://dig.chouti.com/all/hot/recent/63>
<200 https://dig.chouti.com/all/hot/recent/65>
<200 https://dig.chouti.com/all/hot/recent/66>
<200 https://dig.chouti.com/all/hot/recent/68>
<200 https://dig.chouti.com/all/hot/recent/67>
<200 https://dig.chouti.com/all/hot/recent/69>
<200 https://dig.chouti.com/all/hot/recent/70>
<200 https://dig.chouti.com/all/hot/recent/71>
<200 https://dig.chouti.com/all/hot/recent/72>
<200 https://dig.chouti.com/all/hot/recent/73>
<200 https://dig.chouti.com/all/hot/recent/74>
<200 https://dig.chouti.com/all/hot/recent/75>
<200 https://dig.chouti.com/all/hot/recent/76>
<200 https://dig.chouti.com/all/hot/recent/78>
<200 https://dig.chouti.com/all/hot/recent/77>
<200 https://dig.chouti.com/all/hot/recent/79>
<200 https://dig.chouti.com/all/hot/recent/80>
<200 https://dig.chouti.com/all/hot/recent/82>
<200 https://dig.chouti.com/all/hot/recent/81>
<200 https://dig.chouti.com/all/hot/recent/84>
<200 https://dig.chouti.com/all/hot/recent/83>
<200 https://dig.chouti.com/all/hot/recent/85>
<200 https://dig.chouti.com/all/hot/recent/86>
<200 https://dig.chouti.com/all/hot/recent/87>
<200 https://dig.chouti.com/all/hot/recent/88>
<200 https://dig.chouti.com/all/hot/recent/89>
<200 https://dig.chouti.com/all/hot/recent/90>
<200 https://dig.chouti.com/all/hot/recent/91>
<200 https://dig.chouti.com/all/hot/recent/92>
<200 https://dig.chouti.com/all/hot/recent/94>
<200 https://dig.chouti.com/all/hot/recent/93>
<200 https://dig.chouti.com/all/hot/recent/96>
<200 https://dig.chouti.com/all/hot/recent/95>
<200 https://dig.chouti.com/all/hot/recent/98>
<200 https://dig.chouti.com/all/hot/recent/97>
<200 https://dig.chouti.com/all/hot/recent/100>
<200 https://dig.chouti.com/all/hot/recent/99>
<200 https://dig.chouti.com/all/hot/recent/102>
<200 https://dig.chouti.com/all/hot/recent/101>
<200 https://dig.chouti.com/all/hot/recent/103>
<200 https://dig.chouti.com/all/hot/recent/104>
<200 https://dig.chouti.com/all/hot/recent/105>
<200 https://dig.chouti.com/all/hot/recent/106>
<200 https://dig.chouti.com/all/hot/recent/107>
<200 https://dig.chouti.com/all/hot/recent/108>
<200 https://dig.chouti.com/all/hot/recent/109>
<200 https://dig.chouti.com/all/hot/recent/110>
<200 https://dig.chouti.com/all/hot/recent/111>
<200 https://dig.chouti.com/all/hot/recent/112>
<200 https://dig.chouti.com/all/hot/recent/113>
<200 https://dig.chouti.com/all/hot/recent/114>
<200 https://dig.chouti.com/all/hot/recent/115>
<200 https://dig.chouti.com/all/hot/recent/116>
<200 https://dig.chouti.com/all/hot/recent/118>
<200 https://dig.chouti.com/all/hot/recent/117>
<200 https://dig.chouti.com/all/hot/recent/119>
<200 https://dig.chouti.com/all/hot/recent/120>

注意:

  如果后续对爬取的页面数据进行处理,用xpath获取数据,yield到 管道再进行相应的存储操作

原文地址:https://www.cnblogs.com/cjj-zyj/p/10144860.html