Scrapy-02-item管道、shell、选择器

Scrapy-02

item管道：

scrapy提供了item对象来对爬取的数据进行保存，它的使用方法和字典类似，不过，相比字典，item多了额外的保护机制，可以避免拼写错误和定义字段错误。
创建的item需要继承scrapy.Item类，并且在里面定义Field字段。(我们爬取的是盗墓笔记，只有文章标题和内容两个字段)
定义item，在item.py中修改：

 1 # -*- coding: utf-8 -*-
 2 
 3 # Define here the models for your scraped items
 4 #
 5 # See documentation in:
 6 # https://doc.scrapy.org/en/latest/topics/items.html
 7 
 8 import scrapy
 9 
10 
11 class BooksItem(scrapy.Item):
12     # define the fields for your item here like:
13     # name = scrapy.Field()
14     title = scrapy.Field()
15     content = scrapy.Field()

解析response和对item的使用：

 1 # -*- coding: utf-8 -*-
 2 import scrapy
 3 from ..items import BooksItem
 4 
 5 class DmbjSpider(scrapy.Spider):
 6     name = 'dmbj'
 7     allowed_domains = ['www.cread.com']
 8     start_urls = ['http://www.cread.com/chapter/811400395/69162457.html/']
 9 
10     def parse(self, response):
11         item = BooksItem()
12         item['title'] = response.xpath('//h1/text()').extract_first()
13         item['content'] = response.xpath('//div[@class="chapter_con"]/text()').extract_first()
14         yield item
15

 1 # -*- coding: utf-8 -*-
 2 
 3 # Define your item pipelines here
 4 #
 5 # Don't forget to add your pipeline to the ITEM_PIPELINES setting
 6 # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
 7 
 8 
 9 class BooksPipeline(object):
10     def process_item(self, item, spider):
11         with open('files/{}.txt'.format(item['title']), 'w+') as f:
12             f.write(item['content'])
13         return item
14 
15     def open_spider(self, spider):
16         # 爬虫启动时调用
17         pass
18 
19     def close_spider(self, spider):
20         # 爬虫关闭时调用
21         pass

在parse方法中导入item中定义需要的类，将该类实例化，实例化的类对他进行字典的方式操作，直接对其赋值，字典的key值必须和类中对应的字段名字一直。

然后对其使用yield
在pipline.py里面定义三个方法:
- process_item:
  - 对parse返回的item进行处理，然后在返回出去
- open_spider：
  - 爬虫启动的时候自动调用
- close_spider：
  - 爬虫关闭的时候调用
pipline里面定义的pipline需要使用，就得到setting里面讲ITEM_PIPELINES的字典激活

ITEM_PIPELINES = {
   'books.pipelines.BooksPipeline': 300,
}

shell
- scrapy shell 是scrapy提供的一个交互式的调试工具，如果当前环境中安装了ipython，那么将默认调用ipython，也可以在scrapy.cfg的setting下设置: shell = ipython
- 使用scrapy shell：
  - 终端输入: scrapy shell [url] //url：想爬取的网址，可不添加（也可以是个本地的文件，以路径的方式写入）
- fetch：
  - fetch接受一个url，构成一个新的请求对象，对返回新的response