Scrapy框架基本使用

pycharm+Scrapy

距离上次使用Scrapy已经是大半年前的事情了，赶紧把西瓜皮捡回来。。

简单粗暴上爬取目标：

初始URL：http://quotes.toscrape.com/

目标：将每一页中每一栏的语录、作者、标签解析出来，保存到json文件或者MongoDB数据库中

打开命令行，敲

scrapy startproject quotetutorial #在当前目录下生成了一个叫quotetutorial的项目

然后敲cd quotetutorail，然后敲

scrapy genspider quotes quotes.toscrape.com #创建一个目标站点的爬虫

此时项目结构如下：

做一下解释：

iems:定义存储数据的Item类

settings:变量的配置信息

pipeline:负责处理被Spider提取出来的Item，典型应用有：清理HTML数据；验证爬取数据的合法性，检查Item是否包含某些字段；查重并丢弃；将爬取结果保存到文件或者数据库中

middlewares:中间件

spiders > quotes:爬虫模块

接着我们修改quotes.py代码：

# -*- coding: utf-8 -*-
import scrapy
from quotetutorial.items import QuotetutorialItem
from urllib.parse import urljoin
class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    allowed_domains = ['quotes.toscrape.com']
    start_urls = ['http://quotes.toscrape.com/']

    def parse(self, response):
        quotes = response.css('.quote')
        for quote in quotes:
            item = QuotetutorialItem()
            text = quote.css('.text::text').extract_first()
            author = quote.css('.author::text').extract_first()
            tags = quote.css('.tags .tag::text').extract()
            item['text'] = text
            item['author'] = author
            item['tags'] = tags
            yield item

        next = response.css('.pager .next a::attr(href)').extract_first()#提取翻页的url
        url = response.urljoin(next) #作url拼接
        if url:
            yield scrapy.Request(url=url,callback=self.parse)#回调parse函数

然后是pipelines.py文件

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
from scrapy.exceptions import DropItem
from pymongo import MongoClient

class TextPipeline(object):#对item数据处理，限制字段大小
    def __init__(self):
        self.limit = 50

    def process_item(self, item, spider):
        if item['text']:
            if len(item['text']) > self.limit:
                item['text'] = item['text'][0:self.limit].rstrip() + '...'
            return item
        else:
            return DropItem('Missing Text')

class MongoPipeline(object):#保存到MongoDB数据库

    def __init__(self,mongo_uri,mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db

    @classmethod
    def from_crawler(cls,crawler):
        return cls(
            mongo_uri = crawler.settings.get('MONGO_URI'),
            mongo_db = crawler.settings.get('MONGO_DB')
        )

    def open_spider(self,spider):
        self.client = MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]

    def process_item(self,item,spider):
        name = item.__class__.__name__
        self.db[name].insert(dict(item))
        return item

    def close_spider(self,spider):
        self.client.close()

然后是items.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class QuotetutorialItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    text = scrapy.Field()
    author = scrapy.Field()
    tags = scrapy.Field()

然后修改settings.py

SPIDER_MODULES = ['quotetutorial.spiders']
NEWSPIDER_MODULE = 'quotetutorial.spiders'

MONGO_URI = 'localhost'
MONGO_DB = 'quotestutorial'

# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'quotetutorial.pipelines.TextPipeline': 300,　　　　　　#数字越小表示优先级越高，先处理 
    'quotetutorial.pipelines.MongoPipeline': 400,
}

这里需要注意的地方是：

Scrapy有自己的一套数据提取机制，成为Selector，通过Xpath或者CSS来解析HTML，用法和普通的选择器一样

把CSS换成XPATH如下：

    def parse(self, response):
        quotes = response.xpath(".//*[@class='quote']")
        for quote in quotes:
            item = QuotetutorialItem()
            # text = quote.css('.text::text').extract_first()
            # author = quote.css('.author::text').extract_first()
            # tags = quote.css('.tags .tag::text').extract()
            text = quote.xpath(".//span[@class='text']/text()").extract()[0]
            author = quote.xpath(".//span/small[@class='author']/text()").extract()[0]
            tags = quote.xpath(".//div[@class='tags']/a/text()").extract()
            item['text'] = text
            item['author'] = author
            item['tags'] = tags

            # item['tags'] = tags
            yield item

人生苦短，何不用python