crawlspider

 - CrawlSpider继承自Spider,Spider类的设计原则是只爬取start_url列表中的网页,而CrawlSpider类定义了一些规则(rule)来提供跟进link的方便的机制,从爬取的网页中获取link并继续爬取。

 - 创建项目与之前不同

scrapy startproject ct
cd ct
scrapy genspider -t crawl chouti www.xxx.com

 - 简单爬取抽屉网全部url

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class CtSpider(CrawlSpider):
    name = 'ct'
    # allowed_domains = ['www.xxx.com']
    start_urls = ['https://dig.chouti.com/all/hot/recent/1']

    # 连接提取器:
    # allow:表示的就是链接提取器提取连接的规则(正则)
    link = LinkExtractor(allow=r'/all/hot/recent/d+')

    rules = (
        #规则解析器:将链接提取器提取到的连接所对应的页面数据进行指定形式的解析
        Rule(link, callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        print(response)

 - 糗事百科

class CtSpider(CrawlSpider):
    name = 'qiubai'

    start_urls = ['https://www.qiushibaike.com/pic/']

    link = LinkExtractor(allow=r'/pic/page/d+?s=d+')
    link1 = LinkExtractor(allow=r'/pic/$')
    rules = (
        Rule(link, callback='parse_item', follow=True),
        Rule(link1, callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        print(response)

  

链接

原文地址:https://www.cnblogs.com/lzmdbk/p/10477503.html