scrapy 学习笔记2

本章学习爬虫的

回调和跟踪链接
使用参数

回调和跟踪链接

上一篇的另一个爬虫,这次是为了抓取作者信息

# -*- coding: utf-8 -*-
import scrapy

class MyspiderAuthorSpider(scrapy.Spider):
    name = 'myspider_author'
    start_urls = ['http://quotes.toscrape.com/']

    def parse(self, response):
        # 链接到作者页面
        for href in response.xpath('//div[@class="quote"]/span/a/@href'):
            yield response.follow(href, self.parse_author)

        # 链接到下一页
        for href in response.xpath('//li[@class="next"]/a/@href'):
            yield response.follow(href, self.parse)

    def parse_author(self, response):
        yield {
            'name':response.xpath('//h3[@class="author-title"]/text()').extract_first(),
            'birthdate':response.xpath('//span[@class="author-born-date"]/text()').extract_first()
        }

这个爬虫将从主页面开始，以 parse_author 回调方法跟踪所有到作者页面的链接，以 parse 回调方法跟踪其它页面。

这里我们将回调方法作为参数直接传递给 response.follow，这样代码更短，也可以传递给 scrapy.Request。

这个爬虫演示的另一个有趣的事是，即使同一作者有许多名言，我们也不用担心多次访问同一作者的页面。默认情况下，Scrapy 会将重复的请求过滤出来，避免了由于编程错误而导致的重复服务器的问题。如果你非要重复,改成这样:

yield response.follow(href, self.parse_author,dont_filter=True)

通过这样的爬虫,我们做了这样的一个事:获得了网站地图,挨着进去访问,获取信息.

上一篇最基础的爬虫,是根据"下一页",不停的往下找,中间可能会断掉,注意两者的区别

spider类参数传递

在运行爬虫时，可以通过 -a 选项为您的爬虫提供命令行参数：

dahu@dahu-OptiPlex-3046:~/PycharmProjects/SpiderLearning/quotesbot$ scrapy crawl toscrape-xpath-tag -a tag=humor -o t1.jl

默认情况下，这些参数将传递给 Spider 的 __init__ 方法并成为爬虫的属性。

在此示例中，通过 self.tag 获取命令行中参数 tag 的值。您可以根据命令行参数构建 URL，使您的爬虫只爬取特点标签的名言：

# -*- coding: utf-8 -*-
import scrapy

class ToScrapeSpiderXPath(scrapy.Spider):
    name = 'toscrape-xpath-tag'
    start_urls = [
        'http://quotes.toscrape.com/',
    ]

    def start_requests(self):
        url = 'http://quotes.toscrape.com/'
        tag = getattr(self, 'tag', None)
        if tag is not None:
            url = url + 'tag/' + tag
        yield scrapy.Request(url, self.parse)

    def parse(self, response):
        for quote in response.xpath('//div[@class="quote"]'):
            yield {
                'text': quote.xpath('./span[@class="text"]/text()').extract_first(),
                'author': quote.xpath('.//small[@class="author"]/text()').extract_first(),
                'tag': quote.xpath('.//div[@class="tags"]/a[@class="tag"]/text()').extract()
            }

        next_page_url = response.xpath('//li[@class="next"]/a/@href').extract_first()
        if next_page_url is not None:
            yield scrapy.Request(response.urljoin(next_page_url))

当然你运行爬虫的时候,要是不加-a参数,也是可以正常运行的,这个方法是修改start_requests()方法

另个例子,直接修改__init__()方法

# -*- coding: utf-8 -*-
import scrapy

class Dahu2Spider(scrapy.Spider):
    name = 'dahu2'
    allowed_domains = ['www.sina.com.cn']
    start_urls = ['http://slide.news.sina.com.cn/s/slide_1_2841_197495.html']

    def __init__(self,myurl=None,*args,**kwargs):
        super(Dahu2Spider,self).__init__(*args,**kwargs)
        if myurl==None:
            myurl=Dahu2Spider.start_urls[0]
        print("要爬取的网址为:%s"%myurl)
        self.start_urls=["%s"%myurl]

    def parse(self, response):
        yield {
            'title':response.xpath('//title/text()').extract_first()
        }
        print response.xpath('//title/text()').extract_first()

运行:

dahu@dahu-OptiPlex-3046:~/PycharmProjects/SpiderLearning/quotesbot$ scrapy crawl dahu2 --nolog
要爬取的网址为:http://slide.news.sina.com.cn/s/slide_1_2841_197495.html
沈阳：男子养猪养出新花样 天天逼“二师兄”跳水锻炼_高清图集_新浪网
dahu@dahu-OptiPlex-3046:~/PycharmProjects/SpiderLearning/quotesbot$ scrapy crawl dahu2 -a myurl=http://www.sina.com.cn --nolog
要爬取的网址为:http://www.sina.com.cn
新浪首页

这里注意,yield方法,生成的是个字典的结构,我试了下别的,只能是这4个

2017-08-16 21:36:39 [scrapy.core.scraper] ERROR: Spider must return Request, BaseItem, dict or None, got 'unicode' in <GET http://www.sina.com.cn>

当然我们这里用print打印出来显得很粗糙,用yield生成出来,就是这样子:

{'title': u'u65b0u6d6au9996u9875'}

这里编码问题,可以通过json的库来解决,把内容输出到文件里,可以解决编码问题,这个就不细说了.

skill:

scrapy 在不同的抓取级别的Request之间传递参数的办法，下面的范例中，parse_item通过meat传递给了parse_details参数item，这样就可以再parse_details抓取完成所有的数据后一次返回

class MySpider(BaseSpider):
    name = 'myspider'
    start_urls = (
        'http://example.com/page1',
        'http://example.com/page2',
        )
 
    def parse(self, response):
        # collect `item_urls`
        for item_url in item_urls:
            yield Request(url=item_url, callback=self.parse_item)
 
    def parse_item(self, response):
        item = MyItem()
        # populate `item` fields
        yield Request(url=item_details_url, meta={'item': item},
            callback=self.parse_details)
 
    def parse_details(self, response):
        item = response.meta['item']
        # populate more `item` fields
        return item