Scrapy-01-追踪爬取

  • 目的:利用scrapy完成盗墓笔记小说的抓取
  • 创建项目:
    • scrapy   startproject    books
    • cd  books
    • scrapy   genspider    dmbj
  • 编写parse方法
    •  1 # -*- coding: utf-8 -*-
       2 import scrapy
       3 
       4 
       5 class DmbjSpider(scrapy.Spider):
       6     name = 'dmbj'
       7     allowed_domains = ['www.cread.com/chapter/811400395/69162457.html']
       8     start_urls = ['http://www.cread.com/chapter/811400395/69162457.html/']
       9 
      10     def parse(self, response):
      11         title = response.xpath('//h1/text()').extract_first()
      12         content = response.xpath('//div[@class="chapter_con"]/text()').extract_first()
      13         with open('{}.txt'.format(title), 'w') as f:
      14             f.write(content)

      观察网页源码,利用xpath对信息进行提取,然后写入一个txt文本文件

    • 追踪爬取,在完成对单页的爬取之后,接下来对整篇小说进行爬取
    • 首先分析网页:
      • 单页爬取已经完成,想要爬取下一章就得找到下一章的url
      • 网页的最后又一个"下一章"的按钮,我们拿到按钮里面的href属性的值就行了
      • 注意href的值为相对url,我们需要将完整的url拼接起来
      • 利用response.urljoin(你的相对url),即可完成完整的url拼接
    • 提取到下一页的绝对url之后利用scrapy.Request方法来对下一页进行爬取
    • 这里的allowed_domains 要改成"www.cread.com"
    •  1 # -*- coding: utf-8 -*-
       2 import scrapy
       3 
       4 
       5 class DmbjSpider(scrapy.Spider):
       6     name = 'dmbj'
       7     allowed_domains = ['www.cread.com']
       8     start_urls = ['http://www.cread.com/chapter/811400395/69162457.html/']
       9 
      10     def parse(self, response):
      11         title = response.xpath('//h1/text()').extract_first()
      12         content = response.xpath('//div[@class="chapter_con"]/text()').extract_first()
      13         with open('{}.txt'.format(title), 'w') as f:
      14             f.write(content)
      15         next_url = response.xpath('//a[@id="go_next"]/@href').extract_first()
      16         url = response.urljoin(next_url)
      17         return scrapy.Request(url)
    • 最后scrapy crawl dmbj   运行爬虫开始抓取
原文地址:https://www.cnblogs.com/ivy-blogs/p/10884286.html