scrapy


Xpath下根据标签获取指定标签的text,相关属性值。
要能够准确的定位到列表中的某一项(通过id或class)
根据标签或相关属性的值进行过滤

response.xpath('//*[@id="resultList"]/div[4]/span[1]/a/@href').extract_first()

获取标签id为resultList的标签,向下第4个div元素,再向下第1个span元素,向下的a标签,获取a标签的href属性




CSS根据css样式获取指定的某个元素或元素列表
获取指标签的text,相关属性值
要能准确的定位到列表中的某一项
如果一个标签有多个css样式的情况下,怎么写





Scrapy xpath

表达式描述
nodename 选取此节点的所有子节点。
/ 从根节点选取。
// 从匹配选择的当前节点选择文档中的节点,而不考虑它们的位置。
. 选取当前节点。
.. 选取当前节点的父节点。
@ 选取属性。
路径表达式结果
/bookstore/book[1] 选取属于 bookstore 子元素的第一个 book 元素。
/bookstore/book[last()] 选取属于 bookstore 子元素的最后一个 book 元素。
/bookstore/book[last()-1] 选取属于 bookstore 子元素的倒数第二个 book 元素。
/bookstore/book[position()<3] 选取最前面的两个属于 bookstore 元素的子元素的 book 元素。
//title[@lang] 选取所有拥有名为 lang 的属性的 title 元素。
//title[@lang=’eng’] 选取所有 title 元素,且这些元素拥有值为 eng 的 lang 属性。
/bookstore/book[price>35.00] 选取 bookstore 元素的所有 book 元素,且其中的 price 元素的值须大于 35.00。
/bookstore/book[price>35.00]/title 选取 bookstore 元素中的 book 元素的所有 title 元素,且其中的 price 元素的值须大于 35.00。

几个简单的例子:

/html/head/title: 选择HTML文档<head>元素下面的<title> 标签。   方法2:response.xpath('//title') 获取了网页的标题  //效率低,不建议使用
/html/head/title/text(): 选择前面提到的<title> 元素下面的文本内容
//td: 选择所有 <td> 元素
//div[@class="mine"]: 选择所有包含 class="mine" 属性的div 标签元素

Scrapy使用css和xpath选择器来定位元素,它有四个基本方法:
xpath(): 返回选择器列表,每个选择器代表使用xpath语法选择的节点
css(): 返回选择器列表,每个选择器代表使用css语法选择的节点
extract(): 返回被选择元素的unicode字符串
re(): 返回通过正则表达式提取的unicode字符串列表

>>> response.xpath('//title/text()')  
[<Selector (text) xpath=//title/text()>]  
>>> response.css('title::text')  
[<Selector (text) xpath=//title/text()>]  

Scrapy没有进行预期的循环抓取的操作,
解决办法:将allow_domain中的域名改为与爬取url一致即可
原因是 allow_domain中的域名写错了,与待爬取url不一致

已更改过的代码如下:

# -*- coding: utf-8 -*-
import scrapy
from scrapy_demo7.items import ScrapyDemo7Item
from scrapy.http import Request


class ZhilianSpider(scrapy.Spider):
    name = 'zhilian'
    allowed_domains = ['zhaopin.com']
    start_urls = ['http://sou.zhaopin.com/jobs/searchresult.ashx?jl=%E4%B8%8A%E6%B5%B7&sm=0&p=1']

    def parse(self, response):
        tables = response.xpath('//*[@id="newlist_list_content_table"]/table')
        for table in tables:
            item = ScrapyDemo7Item()
            first = table.xpath('./tbody/tr[1]/td[1]/div/a/@href').extract_first()
            print("first", first)
            tableRecord = table.xpath("./tr[1]")
            jobInfo = tableRecord.xpath("./td[@class='zwmc']/div/a")
            item["job_name"] = jobInfo.xpath("./text()").extract_first()
            item["company_name"] = tableRecord.xpath("./td[@class='gsmc']/a[@target='_blank']/text()").extract_first()
            item["job_provide_salary"] = tableRecord.xpath("./td[@class='zwyx']/text()").extract_first()
            item["job_location"] = tableRecord.xpath("./td[@class='gzdd']/text()").extract_first()
            item["job_release_date"] = tableRecord.xpath("./td[@class='gxsj']/span/text()").extract_first()
            item["job_url"] = jobInfo.xpath("./@href").extract_first()
            yield item
        for i in range(1, 21):
            url = "http://sou.zhaopin.com/jobs/searchresult.ashx?jl=%E4%B8%8A%E6%B5%B7&sm=0&p=" + str(i)
            print(url)
            yield Request(url, callback=self.parse)




C:Usersuser>pip3 install scrapy
Collecting scrapy
  Using cached Scrapy-1.4.0-py2.py3-none-any.whl
Collecting parsel>=1.1 (from scrapy)
  Using cached parsel-1.2.0-py2.py3-none-any.whl
Requirement already satisfied: service-identity in d:python362libsite-packages (from scrapy)
Requirement already satisfied: w3lib>=1.17.0 in d:python362libsite-packages (from scrapy)
Requirement already satisfied: cssselect>=0.9 in d:python362libsite-packages (from scrapy)
Requirement already satisfied: queuelib in d:python362libsite-packages (from scrapy)
Requirement already satisfied: lxml in d:python362libsite-packages (from scrapy)
Requirement already satisfied: PyDispatcher>=2.0.5 in d:python362libsite-packages (from scrapy)
Requirement already satisfied: six>=1.5.2 in d:python362libsite-packages (from scrapy)
Collecting Twisted>=13.1.0 (from scrapy)
  Using cached Twisted-17.9.0.tar.bz2
Requirement already satisfied: pyOpenSSL in d:python362libsite-packages (from scrapy)
Requirement already satisfied: attrs in d:python362libsite-packages (from service-identity->scrapy)
Requirement already satisfied: pyasn1-modules in d:python362libsite-packages (from service-identity->scrapy)
Requirement already satisfied: pyasn1 in d:python362libsite-packages (from service-identity->scrapy)
Requirement already satisfied: zope.interface>=4.0.2 in d:python362libsite-packages (from Twisted>=13.1.0->scrapy)
Requirement already satisfied: constantly>=15.1 in d:python362libsite-packages (from Twisted>=13.1.0->scrapy)
Requirement already satisfied: incremental>=16.10.1 in d:python362libsite-packages (from Twisted>=13.1.0->scrapy)
Requirement already satisfied: Automat>=0.3.0 in d:python362libsite-packages (from Twisted>=13.1.0->scrapy)
Requirement already satisfied: hyperlink>=17.1.1 in d:python362libsite-packages (from Twisted>=13.1.0->scrapy)
Requirement already satisfied: cryptography>=2.1.4 in d:python362libsite-packages (from pyOpenSSL->scrapy)
Requirement already satisfied: setuptools in d:python362libsite-packages (from zope.interface>=4.0.2->Twisted>=13.1.0->scrapy)
Requirement already satisfied: cffi>=1.7; platform_python_implementation != "PyPy" in d:python362libsite-packages (from cryptography>=2.1.4->pyOpenSSL->scrapy)
Requirement already satisfied: asn1crypto>=0.21.0 in d:python362libsite-packages (from cryptography>=2.1.4->pyOpenSSL->scrapy)
Requirement already satisfied: idna>=2.1 in d:python362libsite-packages (from cryptography>=2.1.4->pyOpenSSL->scrapy)
Requirement already satisfied: pycparser in d:python362libsite-packages (from cffi>=1.7; platform_python_implementation != "PyPy"->cryptography>=2.1.4->pyOpenSSL->scrapy)
Installing collected packages: parsel, Twisted, scrapy
  Running setup.py install for Twisted ... done
Successfully installed Twisted-17.9.0 parsel-1.2.0 scrapy-1.4.0

C:Usersuser>scrapy
Scrapy 1.4.0 - no active project

Usage:
  scrapy <command> [options] [args]

Available commands:
  bench         Run quick benchmark test
  fetch         Fetch a URL using the Scrapy downloader
  genspider     Generate new spider using pre-defined templates
  runspider     Run a self-contained spider (without creating a project)
  settings      Get settings values
  shell         Interactive scraping console
  startproject  Create new project
  version       Print Scrapy version
  view          Open URL in browser, as seen by Scrapy

  [ more ]      More commands available when run from project directory

Use "scrapy <command> -h" to see more info about a command

C:Usersuser>scrapy bench
2017-12-13 15:41:49 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: scrapybot)
2017-12-13 15:41:49 [scrapy.utils.log] INFO: Overridden settings: {'CLOSESPIDER_TIMEOUT': 10, 'LOGSTATS_INTERVAL': 1, 'LOG_LEVEL': 'INFO'}
2017-12-13 15:41:50 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.closespider.CloseSpider',
 'scrapy.extensions.logstats.LogStats']
Unhandled error in Deferred:
2017-12-13 15:41:50 [twisted] CRITICAL: Unhandled error in Deferred:

2017-12-13 15:41:50 [twisted] CRITICAL:
Traceback (most recent call last):
  File "d:python362libsite-packages	wistedinternetdefer.py", line 1386, in _inlineCallbacks
    result = g.send(result)
  File "d:python362libsite-packagesscrapycrawler.py", line 77, in crawl
    self.engine = self._create_engine()
  File "d:python362libsite-packagesscrapycrawler.py", line 102, in _create_engine
    return ExecutionEngine(self, lambda _: self.stop())
  File "d:python362libsite-packagesscrapycoreengine.py", line 69, in __init__
    self.downloader = downloader_cls(crawler)
  File "d:python362libsite-packagesscrapycoredownloader\__init__.py", line 88, in __init__
    self.middleware = DownloaderMiddlewareManager.from_crawler(crawler)
  File "d:python362libsite-packagesscrapymiddleware.py", line 58, in from_crawler
    return cls.from_settings(crawler.settings, crawler)
  File "d:python362libsite-packagesscrapymiddleware.py", line 34, in from_settings
    mwcls = load_object(clspath)
  File "d:python362libsite-packagesscrapyutilsmisc.py", line 44, in load_object
    mod = import_module(module)
  File "d:python362libimportlib\__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 978, in _gcd_import
  File "<frozen importlib._bootstrap>", line 961, in _find_and_load
  File "<frozen importlib._bootstrap>", line 950, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 655, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 678, in exec_module
  File "<frozen importlib._bootstrap>", line 205, in _call_with_frames_removed
  File "d:python362libsite-packagesscrapydownloadermiddlewares
etry.py", line 20, in <module>
    from twisted.web.client import ResponseFailed
  File "d:python362libsite-packages	wistedwebclient.py", line 42, in <module>
    from twisted.internet.endpoints import HostnameEndpoint, wrapClientTLS
  File "d:python362libsite-packages	wistedinternetendpoints.py", line 41, in <module>
    from twisted.internet.stdio import StandardIO, PipeAddress
  File "d:python362libsite-packages	wistedinternetstdio.py", line 30, in <module>
    from twisted.internet import _win32stdio
  File "d:python362libsite-packages	wistedinternet\_win32stdio.py", line 9, in <module>
    import win32api
ModuleNotFoundError: No module named 'win32api'
C:Usersuser>pip3 install pypiwin32
Collecting pypiwin32
  Downloading pypiwin32-220-cp36-none-win32.whl (8.3MB)
    100% |████████████████████████████████| 8.3MB 34kB/s
Installing collected packages: pypiwin32
Successfully installed pypiwin32-220

C:Usersuser>
C:Usersuser>scrapy bench
2017-12-13 15:49:05 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: scrapybot)
2017-12-13 15:49:05 [scrapy.utils.log] INFO: Overridden settings: {'CLOSESPIDER_TIMEOUT': 10, 'LOGSTATS_INTERVAL': 1, 'LOG_LEVEL': 'INFO'}
2017-12-13 15:49:06 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.closespider.CloseSpider',
 'scrapy.extensions.logstats.LogStats']
2017-12-13 15:49:06 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-12-13 15:49:06 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-12-13 15:49:06 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2017-12-13 15:49:06 [scrapy.core.engine] INFO: Spider opened
2017-12-13 15:49:06 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-12-13 15:49:07 [scrapy.extensions.logstats] INFO: Crawled 85 pages (at 5100 pages/min), scraped 0 items (at 0 items/min)
2017-12-13 15:49:08 [scrapy.extensions.logstats] INFO: Crawled 157 pages (at 4320 pages/min), scraped 0 items (at 0 items/min)
2017-12-13 15:49:09 [scrapy.extensions.logstats] INFO: Crawled 229 pages (at 4320 pages/min), scraped 0 items (at 0 items/min)
2017-12-13 15:49:10 [scrapy.extensions.logstats] INFO: Crawled 293 pages (at 3840 pages/min), scraped 0 items (at 0 items/min)
2017-12-13 15:49:11 [scrapy.extensions.logstats] INFO: Crawled 357 pages (at 3840 pages/min), scraped 0 items (at 0 items/min)
2017-12-13 15:49:12 [scrapy.extensions.logstats] INFO: Crawled 413 pages (at 3360 pages/min), scraped 0 items (at 0 items/min)
2017-12-13 15:49:13 [scrapy.extensions.logstats] INFO: Crawled 469 pages (at 3360 pages/min), scraped 0 items (at 0 items/min)
2017-12-13 15:49:14 [scrapy.extensions.logstats] INFO: Crawled 517 pages (at 2880 pages/min), scraped 0 items (at 0 items/min)
2017-12-13 15:49:15 [scrapy.extensions.logstats] INFO: Crawled 573 pages (at 3360 pages/min), scraped 0 items (at 0 items/min)
2017-12-13 15:49:16 [scrapy.core.engine] INFO: Closing spider (closespider_timeout)
2017-12-13 15:49:16 [scrapy.extensions.logstats] INFO: Crawled 621 pages (at 2880 pages/min), scraped 0 items (at 0 items/min)
2017-12-13 15:49:17 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 284168,
 'downloader/request_count': 629,
 'downloader/request_method_count/GET': 629,
 'downloader/response_bytes': 1976557,
 'downloader/response_count': 629,
 'downloader/response_status_count/200': 629,
 'finish_reason': 'closespider_timeout',
 'finish_time': datetime.datetime(2017, 12, 13, 7, 49, 17, 78107),
 'log_count/INFO': 17,
 'request_depth_max': 21,
 'response_received_count': 629,
 'scheduler/dequeued': 629,
 'scheduler/dequeued/memory': 629,
 'scheduler/enqueued': 12581,
 'scheduler/enqueued/memory': 12581,
 'start_time': datetime.datetime(2017, 12, 13, 7, 49, 6, 563037)}
2017-12-13 15:49:17 [scrapy.core.engine] INFO: Spider closed (closespider_timeout)

C:Usersuser>

在本教程中,我们假设您已经安装了Scrapy。如果没有,请参阅安装指南

我们将要抓取 quotes.toscrape.com,一个列出著名作家的名言(quote)的网站。

本教程将引导您完成以下任务:

  1. 创建一个新的 Scrapy 项目
  2. 编写一个爬虫来爬取站点并提取数据
  3. 使用命令行导出抓取的数据
  4. 改写爬虫以递归地跟踪链接
  5. 使用爬虫参数

Scrapy 是用 Python 编写的。如果你没学过 Python,你可能需要了解一下这个语言,以充分利用 Scrapy。

如果您已经熟悉其他语言,并希望快速学习 Python,我们建议您阅读 Dive Into Python 3。或者,您可以学习 Python 教程

如果您刚开始编程,并希望从 Python 开始,在线电子书《Learn Python The Hard Way》非常有用。您也可以查看非程序员的 Python 资源列表

创建一个项目

在开始抓取之前,您必须创建一个新的 Scrapy 项目。 进入您要存储代码的目录,然后运行:

scrapy startproject tutorial

这将创建一个包含以下内容的 tutorial 目录:

复制代码
tutorial/
    scrapy.cfg            # 项目配置文件
    tutorial/             # 项目的 Python 模块,放置您的代码的地方
        __init__.py
        items.py          # 项目项(item)定义文件
        pipelines.py      # 项目管道(piplines)文件
        settings.py       # 项目设置文件
        spiders/          # 一个你以后会放置 spider 的目录
            __init__.py
复制代码

第一个爬虫

Spider 是您定义的类,Scrapy 用它从网站(或一组网站)中抓取信息。 他们必须是 scrapy.Spider 的子类并定义初始请求,和如何获取要继续抓取的页面的链接,以及如何解析下载的页面来提取数据。

这是我们第一个爬虫的代码。 将其保存在项目中的 tutorial/spiders 目录下的名为 quotes_spider.py 的文件中:

复制代码
import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        urls = [
            'http://quotes.toscrape.com/page/1/',
            'http://quotes.toscrape.com/page/2/',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = 'quotes-%s.html' % page
        with open(filename, 'wb') as f:
            f.write(response.body)
        self.log('Saved file %s' % filename)
复制代码

你可以看到,我们的 Spider 是 scrapy.Spider 的子类并定义了一些属性和方法:

  • name:用于识别 Spider。 它在项目中必须是唯一的,也就是说,您不能为不同的 Spider 设置相同的名称。
  • start_requests():必须返回一个 Requests 的迭代(您可以返回一个 requests 列表或者写一个生成器函数),Spider 将从这里开始抓取。 随后的请求将从这些初始请求连续生成。
  • parse():用来处理每个请求得到的响应的方法。 响应参数是 TextResponse 的一个实例,它保存页面内容,并且还有其他有用的方法来处理它。

parse() 方法通常解析响应,将抓取的数据提取为字典,并且还可以查找新的 URL 来跟踪并从中创建新的请求(Request)。

如何运行我们的爬虫

要使我们的爬虫工作,请进入项目的根目录并运行:

scrapy crawl quotes

这个命令运行我们刚刚添加的名称为 quotes 的爬虫,它将向 quotes.toscrape.com 发送一些请求。 你将得到类似于这样的输出:

复制代码
... (omitted for brevity)
2016-12-16 21:24:05 [scrapy.core.engine] INFO: Spider opened
2016-12-16 21:24:05 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-12-16 21:24:05 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-12-16 21:24:05 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/robots.txt> (referer: None)
2016-12-16 21:24:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/1/> (referer: None)
2016-12-16 21:24:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/2/> (referer: None)
2016-12-16 21:24:05 [quotes] DEBUG: Saved file quotes-1.html
2016-12-16 21:24:05 [quotes] DEBUG: Saved file quotes-2.html
2016-12-16 21:24:05 [scrapy.core.engine] INFO: Closing spider (finished)
...
复制代码

现在,查看当前目录下的文件。 您会发现已经创建了两个新文件:quotes-1.html 和 quotes-2.html,其中包含各个URL的内容,就像我们的 parse 方法指示一样。

注意

如果您想知道为什么我们还没有解析 HTML,请继续,我们将尽快介绍。

这个过程中发生了什么?

Spider 的 start_requests 方法返回 scrapy.Request 对象,Scrapy 对其发起请求 。然后将收到的响应实例化为 Response 对象,以响应为参数调用请求对象中定义的回调方法(在这里为 parse 方法)。

start_requests 方法的快捷方式

用于代替实现一个从 URL 生成 scrapy.Request 对象的 start_requests() 方法,您可以用 URL 列表定义一个 start_urls 类属性。 此列表将默认替代 start_requests() 方法为您的爬虫创建初始请求:

复制代码
import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
        'http://quotes.toscrape.com/page/2/',
    ]

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = 'quotes-%s.html' % page
        with open(filename, 'wb') as f:
            f.write(response.body)
复制代码

Scrapy 将调用 parse() 方法来处理每个 URL 的请求,即使我们没有明确告诉 Scrapy 这样做。 这是因为 parse() 是 Scrapy 的默认回调方法,没有明确分配回调方法的请求默认调用此方法。

提取数据

学习如何使用 Scrapy 提取数据的最佳方式是在 Scrapy shell 中尝试一下选择器。 运行:

scrapy shell 'http://quotes.toscrape.com/page/1/'

注意

在从命令行运行 Scrapy shell 时必须给 url 加上引号,否则包含参数(例如 &符号)的 url 将不起作用。

在Windows上,要使用双引号:

scrapy shell "http://quotes.toscrape.com/page/1/"

你将会看到:

复制代码
[ ... Scrapy log here ... ]
2016-09-19 12:09:27 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/1/> (referer: None)
[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x7fa91d888c90>
[s]   item       {}
[s]   request    <GET http://quotes.toscrape.com/page/1/>
[s]   response   <200 http://quotes.toscrape.com/page/1/>
[s]   settings   <scrapy.settings.Settings object at 0x7fa91d888c10>
[s]   spider     <DefaultSpider 'default' at 0x7fa91c8af990>
[s] Useful shortcuts:
[s]   shelp()           Shell help (print this help)
[s]   fetch(req_or_url) Fetch request (or URL) and update local objects
[s]   view(response)    View response in a browser
>>>
复制代码

使用 shell,您可以尝试使用 CSS 选择器选择元素:

>>> response.css('title')
[<Selector xpath='descendant-or-self::title' data='<title>Quotes to Scrape</title>'>]

运行 response.css('title') 返回的结果是一个 SelectorList 类列表对象,它是一个指向 XML/HTML 元素的 Selector 对象的列表,允许您进行进一步的查询来细分选择或提取数据。

要从上面的 title 中提取文本,您可以执行以下操作:

>>> response.css('title::text').extract()
['Quotes to Scrape']

这里有两件事情要注意:一个是我们在 CSS 查询中添加了 ::text,这意味着我们只想要 <title> 元素中的文本。 如果我们不指定 ::text,我们将得到完整的 title 元素,包括其标签:

>>> response.css('title').extract()
['<title>Quotes to Scrape</title>']

另一件事是调用 .extract() 返回的结果是一个列表,因为我们在处理 SelectorList。 当你明确你只是想要第一个结果时,你可以这样做:

>>> response.css('title::text').extract_first()
'Quotes to Scrape'

或者你可以这样写:

>>> response.css('title::text')[0].extract()
'Quotes to Scrape'

但是,如果没有找到匹配选择的元素,.extract_first() 返回 None,避免了 IndexError

这里有一个教训:对于大多数爬虫代码,您希望它具有容错性,如果在页面上找不到指定的元素导致无法获取某些项,至少其它的数据可以被抓取。

除了 extract() 和 extract_first() 方法之外,还可以使用 re() 方法用正则表达式来提取:

复制代码
>>> response.css('title::text').re(r'Quotes.*')
['Quotes to Scrape']
>>> response.css('title::text').re(r'Qw+')
['Quotes']
>>> response.css('title::text').re(r'(w+) to (w+)')
['Quotes', 'Scrape']
复制代码

为了得到正确的 CSS 选择器语句,您可以在浏览器中打开页面并查看源代码。 您也可以使用浏览器的开发工具或扩展(如 Firebug)(请参阅有关 Using Firebug for scraping 和 Using Firefox for scraping 的部分)。

Selector Gadget 也是一个很好的工具,可以快速找到元素的 CSS 选择器语句,它可以在许多浏览器中运行。

XPath:简要介绍

除了 CSS,Scrapy 选择器还支持使用 XPath 表达式:

>>> response.xpath('//title')
[<Selector xpath='//title' data='<title>Quotes to Scrape</title>'>]
>>> response.xpath('//title/text()').extract_first()
'Quotes to Scrape'

XPath 表达式非常强大,是 Scrapy 选择器的基础。 实际上,如果你查看相关的源代码就可以发现,CSS 选择器被转换为 XPath。 

虽然也许不像 CSS 选择器那么受欢迎,但 XPath 表达式提供更多的功能,因为除了导航结构之外,它还可以查看内容。 使用 XPath,您可以选择以下内容:包含文本“下一页”的链接。 这使得 XPath 非常适合抓取任务,我们鼓励您学习 XPath,即使您已经知道如何使用 CSS 选择器,这会使抓取更容易。

我们不会在这里讲太多关于 XPath 的内容,但您可以阅读 using XPath with Scrapy Selectors 获取更多有关 XPath 的信息。 我们推荐教程 to learn XPath through examples,和教程 “how to think in XPath”

提取名人和名言

现在你知道了如何选择和提取,让我们来完成我们的爬虫,编写代码从网页中提取名言(quote)。

http://quotes.toscrape.com 中的每个名言都由 HTML 元素表示,如下所示:

复制代码
<div class="quote">
    <span class="text">“The world as we have created it is a process of our
    thinking. It cannot be changed without changing our thinking.”</span>
    <span>
        by <small class="author">Albert Einstein</small>
        <a href="/author/Albert-Einstein">(about)</a>
    </span>
    <div class="tags">
        Tags:
        <a class="tag" href="/tag/change/page/1/">change</a>
        <a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>
        <a class="tag" href="/tag/thinking/page/1/">thinking</a>
        <a class="tag" href="/tag/world/page/1/">world</a>
    </div>
</div>
复制代码

让我们打开 scrapy shell 玩一玩,找到提取我们想要的数据的方法:

$ scrapy shell 'http://quotes.toscrape.com'

得到 quote 元素的 selector 列表:

>>> response.css("div.quote")

通过上述查询返回的每个 selector 允许我们对其子元素运行进一步的查询。 让我们将第一个 selector 分配给一个变量,以便我们可以直接在特定的 quote 上运行我们的 CSS 选择器:

>>> quote = response.css("div.quote")[0]

现在,我们使用刚刚创建的 quote 对象,从该 quote 中提取 title,author 和 tags:

复制代码
>>> title = quote.css("span.text::text").extract_first()
>>> title
'“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'
>>> author = quote.css("small.author::text").extract_first()
>>> author
'Albert Einstein'
复制代码

鉴于标签是字符串列表,我们可以使用 .extract() 方法将它们全部提取出来:

>>> tags = quote.css("div.tags a.tag::text").extract()
>>> tags
['change', 'deep-thoughts', 'thinking', 'world']

现在已经弄清楚了如何提取每一个信息,接下来遍历所有 quote 元素,并把它们放在一个 Python 字典中:

复制代码
>>> for quote in response.css("div.quote"):
...     text = quote.css("span.text::text").extract_first()
...     author = quote.css("small.author::text").extract_first()
...     tags = quote.css("div.tags a.tag::text").extract()
...     print(dict(text=text, author=author, tags=tags))
{'tags': ['change', 'deep-thoughts', 'thinking', 'world'], 'author': 'Albert Einstein', 'text': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'}
{'tags': ['abilities', 'choices'], 'author': 'J.K. Rowling', 'text': '“It is our choices, Harry, that show what we truly are, far more than our abilities.”'}
    ... a few more of these, omitted for brevity
>>>
复制代码

在爬虫中提取数据

让我们回到我们的爬虫上。 到目前为止,它并没有提取任何数据,只将整个 HTML 页面保存到本地文件。 让我们将上述提取逻辑整合到我们的爬虫中。

Scrapy 爬虫通常生成许多包含提取到的数据的字典。 为此,我们在回调方法中使用 yield Python 关键字,如下所示:

复制代码
import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
        'http://quotes.toscrape.com/page/2/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').extract_first(),
                'author': quote.css('small.author::text').extract_first(),
                'tags': quote.css('div.tags a.tag::text').extract(),
            }
复制代码

如果您运行此爬虫,它将输出提取的数据与日志:

复制代码
2016-09-19 18:57:19 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>
{'tags': ['life', 'love'], 'author': 'André Gide', 'text': '“It is better to be hated for what you are than to be loved for what you are not.”'}
2016-09-19 18:57:19 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>
{'tags': ['edison', 'failure', 'inspirational', 'paraphrased'], 'author': 'Thomas A. Edison', 'text': "“I have not failed. I've just found 10,000 ways that won't work.”"}
复制代码

存储抓取的数据

存储抓取数据的最简单的方法是使用 Feed exports,使用以下命令:

scrapy crawl quotes -o quotes.json

这将生成一个 quotes.json 文件,其中包含所有抓取到的 JSON 序列化的数据。

由于历史原因,Scrapy 追加内容到给定的文件,而不是覆盖其内容。 如果您在第二次之前删除该文件两次运行此命令,那么最终会出现一个破坏的 JSON 文件。您还可以使用其他格式,如 JSON 行(JSON Lines):

scrapy crawl quotes -o quotes.jl

JSON 行格式很有用,因为它像流一样,您可以轻松地将新记录附加到文件。 当运行两次时,它不会发生 JSON 那样的问题。 另外,由于每条记录都是单独的行,所以您在处理大文件时无需将所有内容放到内存中,还有 JQ 等工具可以帮助您在命令行中执行此操作。

在小项目(如本教程中的一个)中,这应该是足够的。 但是,如果要使用已抓取的项目执行更复杂的操作,则可以编写项目管道(Item Pipeline)。 在工程的创建过程中已经为您创建了项目管道的占位符文件 tutorial/pipelines.py, 虽然您只需要存储已抓取的项目,不需要任何项目管道。

跟踪链接

或许你希望获取网站所有页面的 quotes,而不是从 http://quotes.toscrape.com 的前两页抓取。

现在您已经知道如何从页面中提取数据,我们来看看如何跟踪链接。

首先是提取我们想要跟踪的页面的链接。 检查我们的页面,我们可以看到链接到下一个页面的URL在下面的元素中:

<ul class="pager">
    <li class="next">
        <a href="/page/2/">Next <span aria-hidden="true">&rarr;</span></a>
    </li>
</ul>

我们可以尝试在 shell 中提取它:

>>> response.css('li.next a').extract_first()
'<a href="/page/2/">Next <span aria-hidden="true">→</span></a>'

这得到了超链接元素,但是我们需要其属性 href。 为此,Scrapy 支持 CSS 扩展,您可以选择属性内容,如下所示:

>>> response.css('li.next a::attr(href)').extract_first()
'/page/2/'

现在修改我们的爬虫,改为递归地跟踪下一页的链接,从中提取数据:

复制代码
import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').extract_first(),
                'author': quote.css('small.author::text').extract_first(),
                'tags': quote.css('div.tags a.tag::text').extract(),
            }

        next_page = response.css('li.next a::attr(href)').extract_first()
        if next_page is not None:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(next_page, callback=self.parse)
复制代码

现在,在提取数据之后,parse() 方法查找到下一页的链接,使用 urljoin() 方法构建一个完整的绝对 URL(因为链接可以是相对的),并生成(yield)一个到下一页的新的请求, 其中包括回调方法(parse)。

您在这里看到的是 Scrapy 的链接跟踪机制:当您在一个回调方法中生成(yield)请求(request)时,Scrapy 将安排发起该请求,并注册该请求完成时执行的回调方法。

使用它,您可以根据您定义的规则构建复杂的跟踪链接机制,并根据访问页面提取不同类型的数据。

在我们的示例中,它创建一个循环,跟踪所有到下一页的链接,直到它找不到要抓取的博客,论坛或其他站点分页。

创建请求的快捷方式

作为创建请求对象的快捷方式,您可以使用 response.follow

复制代码
import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').extract_first(),
                'author': quote.css('span small::text').extract_first(),
                'tags': quote.css('div.tags a.tag::text').extract(),
            }

        next_page = response.css('li.next a::attr(href)').extract_first()
        if next_page is not None:
            yield response.follow(next_page, callback=self.parse)
复制代码

不像 scrapy.Request,response.follow 支持相对 URL - 不需要调用urljoin。请注意,response.follow 只是返回一个 Request 实例,您仍然需要生成请求(yield request)。

您也可以将选择器传递给 response.follow,该选择器应该提取必要的属性:

for href in response.css('li.next a::attr(href)'):
    yield response.follow(href, callback=self.parse)

对于<a>元素,有一个快捷方式:response.follow 自动使用它们的 href 属性。 所以代码可以进一步缩短:

for a in response.css('li.next a'):
    yield response.follow(a, callback=self.parse)

注意

response.follow(response.css('li.next a')) 无效,因为 response.css 返回的是一个类似列表的对象,其中包含所有结果的选择器,而不是单个选择器。for 循环或者 response.follow(response.css('li.next a')[0]) 则可以正常工作。

更多的例子和模式

这是另外一个爬虫,示例了回调和跟踪链接,这次是为了抓取作者信息:

复制代码
import scrapy

class AuthorSpider(scrapy.Spider):
    name = 'author'

    start_urls = ['http://quotes.toscrape.com/']

    def parse(self, response):
        # 链接到作者页面
        for href in response.css('.author + a::attr(href)'):
            yield response.follow(href, self.parse_author)

        # 链接到下一页
        for href in response.css('li.next a::attr(href)'):
            yield response.follow(href, self.parse)

    def parse_author(self, response):
        def extract_with_css(query):
            return response.css(query).extract_first().strip()

        yield {
            'name': extract_with_css('h3.author-title::text'),
            'birthdate': extract_with_css('.author-born-date::text'),
            'bio': extract_with_css('.author-description::text'),
        }
复制代码

这个爬虫将从主页面开始, 以 parse_author 回调方法跟踪所有到作者页面的链接,以 parse 回调方法跟踪其它页面。

这里我们将回调方法作为参数直接传递给 response.follow,这样代码更短,也可以传递给 scrapy.Request。

parse_author 回调方法里定义了另外一个函数来根据 CSS 查询语句(query)来提取数据,然后生成包含作者数据的 Python 字典。

这个爬虫演示的另一个有趣的事是,即使同一作者有许多名言,我们也不用担心多次访问同一作者的页面。默认情况下,Scrapy 会将重复的请求过滤出来,避免了由于编程错误而导致的重复服务器的问题。可以通过 DUPEFILTER_CLASS 进行相关的设置。

希望现在您已经了解了 Scrapy 的跟踪链接和回调方法机制。

CrawlSpider 类是一个小规模的通用爬虫引擎,只需要修改其跟踪链接的机制等,就可以在它之上实现你自己的爬虫程序。

另外,一个常见的模式是从多个页面据构建一个包含数据的项(item),有一个将附加数据传递给回调方法的技巧。

使用爬虫参数

在运行爬虫时,可以通过 -a 选项为您的爬虫提供命令行参数:

scrapy crawl quotes -o quotes-humor.json -a tag=humor

默认情况下,这些参数将传递给 Spider 的 __init__ 方法并成为爬虫的属性。

在此示例中,通过 self.tag 获取命令行中参数 tag 的值。您可以根据命令行参数构建 URL,使您的爬虫只爬取特点标签的名言:

复制代码
import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        url = 'http://quotes.toscrape.com/'
        tag = getattr(self, 'tag', None)
        if tag is not None:
            url = url + 'tag/' + tag
        yield scrapy.Request(url, self.parse)

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').extract_first(),
                'author': quote.css('small.author::text').extract_first(),
            }

        next_page = response.css('li.next a::attr(href)').extract_first()
        if next_page is not None:
            yield response.follow(next_page, self.parse)
复制代码

如果您将 tag = humor 传递给爬虫,您会注意到它只会访问标签为 humor 的 URL,例如 http://quotes.toscrape.com/tag/humor。您可以在这里了解更多关于爬虫参数的信息。

下一步

本教程仅涵盖了 Scrapy 的基础知识,还有很多其他功能未在此提及。 查看初窥 Scrapy 中的“还有什么?”部分可以快速了解有哪些重要的内容。

您可以通过目录了解更多有关命令行工具、爬虫、选择器以及本教程未涵盖的其他内容的信息。下一章是示例项目。

http://www.cnblogs.com/-E6-/p/7213872.html

原英文文档:https://docs.scrapy.org/en/latest/topics/commands.html
github上的源码:https://github.com/scrapy/scrapy/tree/1.4

xpath,selector:

Selectors

When you’re scraping web pages, the most common task you need to perform is to extract data from the HTML source. There are several libraries available to achieve this:

  • BeautifulSoup is a very popular web scraping library among Python programmers which constructs a Python object based on the structure of the HTML code and also deals with bad markup reasonably well, but it has one drawback: it’s slow.
  • lxml is an XML parsing library (which also parses HTML) with a pythonic API based on ElementTree. (lxml is not part of the Python standard library.)

Scrapy comes with its own mechanism for extracting data. They’re called selectors because they “select” certain parts of the HTML document specified either by XPath or CSS expressions.

XPath is a language for selecting nodes in XML documents, which can also be used with HTML. CSS is a language for applying styles to HTML documents. It defines selectors to associate those styles with specific HTML elements.

Scrapy selectors are built over the lxml library, which means they’re very similar in speed and parsing accuracy.

This page explains how selectors work and describes their API which is very small and simple, unlike the lxml API which is much bigger because the lxml library can be used for many other tasks, besides selecting markup documents.

For a complete reference of the selectors API see Selector reference

Using selectors

Constructing selectors

Scrapy selectors are instances of Selector class constructed by passing text or TextResponseobject. It automatically chooses the best parsing rules (XML vs HTML) based on input type:

>>> from scrapy.selector import Selector
>>> from scrapy.http import HtmlResponse

Constructing from text:

>>> body = '<html><body><span>good</span></body></html>'
>>> Selector(text=body).xpath('//span/text()').extract()
[u'good']

Constructing from response:

>>> response = HtmlResponse(url='http://example.com', body=body)
>>> Selector(response=response).xpath('//span/text()').extract()
[u'good']

For convenience, response objects expose a selector on .selector attribute, it’s totally OK to use this shortcut when possible:

>>> response.selector.xpath('//span/text()').extract()
[u'good']

Using selectors

To explain how to use the selectors we’ll use the Scrapy shell (which provides interactive testing) and an example page located in the Scrapy documentation server:

Here’s its HTML code:

<html>
 <head>
  <base href='http://example.com/' />
  <title>Example website</title>
 </head>
 <body>
  <div id='images'>
   <a href='image1.html'>Name: My image 1 <br /><img src='image1_thumb.jpg' /></a>
   <a href='image2.html'>Name: My image 2 <br /><img src='image2_thumb.jpg' /></a>
   <a href='image3.html'>Name: My image 3 <br /><img src='image3_thumb.jpg' /></a>
   <a href='image4.html'>Name: My image 4 <br /><img src='image4_thumb.jpg' /></a>
   <a href='image5.html'>Name: My image 5 <br /><img src='image5_thumb.jpg' /></a>
  </div>
 </body>
</html>

First, let’s open the shell:

scrapy shell http://doc.scrapy.org/en/latest/_static/selectors-sample1.html

Then, after the shell loads, you’ll have the response available as response shell variable, and its attached selector in response.selector attribute.

Since we’re dealing with HTML, the selector will automatically use an HTML parser.

So, by looking at the HTML code of that page, let’s construct an XPath for selecting the text inside the title tag:

>>> response.selector.xpath('//title/text()')
[<Selector (text) xpath=//title/text()>]

Querying responses using XPath and CSS is so common that responses include two convenience shortcuts: response.xpath() and response.css():

>>> response.xpath('//title/text()')
[<Selector (text) xpath=//title/text()>]
>>> response.css('title::text')
[<Selector (text) xpath=//title/text()>]

As you can see, .xpath() and .css() methods return a SelectorList instance, which is a list of new selectors. This API can be used for quickly selecting nested data:

>>> response.css('img').xpath('@src').extract()
[u'image1_thumb.jpg',
 u'image2_thumb.jpg',
 u'image3_thumb.jpg',
 u'image4_thumb.jpg',
 u'image5_thumb.jpg']

To actually extract the textual data, you must call the selector .extract() method, as follows:

>>> response.xpath('//title/text()').extract()
[u'Example website']

If you want to extract only first matched element, you can call the selector .extract_first()

>>> response.xpath('//div[@id="images"]/a/text()').extract_first()
u'Name: My image 1 '

It returns None if no element was found:

>>> response.xpath('//div[@id="not-exists"]/text()').extract_first() is None
True

A default return value can be provided as an argument, to be used instead of None:

>>> response.xpath('//div[@id="not-exists"]/text()').extract_first(default='not-found')
'not-found'

Notice that CSS selectors can select text or attribute nodes using CSS3 pseudo-elements:

>>> response.css('title::text').extract()
[u'Example website']

Now we’re going to get the base URL and some image links:

>>> response.xpath('//base/@href').extract()
[u'http://example.com/']

>>> response.css('base::attr(href)').extract()
[u'http://example.com/']

>>> response.xpath('//a[contains(@href, "image")]/@href').extract()
[u'image1.html',
 u'image2.html',
 u'image3.html',
 u'image4.html',
 u'image5.html']

>>> response.css('a[href*=image]::attr(href)').extract()
[u'image1.html',
 u'image2.html',
 u'image3.html',
 u'image4.html',
 u'image5.html']

>>> response.xpath('//a[contains(@href, "image")]/img/@src').extract()
[u'image1_thumb.jpg',
 u'image2_thumb.jpg',
 u'image3_thumb.jpg',
 u'image4_thumb.jpg',
 u'image5_thumb.jpg']

>>> response.css('a[href*=image] img::attr(src)').extract()
[u'image1_thumb.jpg',
 u'image2_thumb.jpg',
 u'image3_thumb.jpg',
 u'image4_thumb.jpg',
 u'image5_thumb.jpg']

Nesting selectors

The selection methods (.xpath() or .css()) return a list of selectors of the same type, so you can call the selection methods for those selectors too. Here’s an example:

>>> links = response.xpath('//a[contains(@href, "image")]')
>>> links.extract()
[u'<a href="image1.html">Name: My image 1 <br><img src="image1_thumb.jpg"></a>',
 u'<a href="image2.html">Name: My image 2 <br><img src="image2_thumb.jpg"></a>',
 u'<a href="image3.html">Name: My image 3 <br><img src="image3_thumb.jpg"></a>',
 u'<a href="image4.html">Name: My image 4 <br><img src="image4_thumb.jpg"></a>',
 u'<a href="image5.html">Name: My image 5 <br><img src="image5_thumb.jpg"></a>']

>>> for index, link in enumerate(links):
...     args = (index, link.xpath('@href').extract(), link.xpath('img/@src').extract())
...     print 'Link number %d points to url %s and image %s' % args

Link number 0 points to url [u'image1.html'] and image [u'image1_thumb.jpg']
Link number 1 points to url [u'image2.html'] and image [u'image2_thumb.jpg']
Link number 2 points to url [u'image3.html'] and image [u'image3_thumb.jpg']
Link number 3 points to url [u'image4.html'] and image [u'image4_thumb.jpg']
Link number 4 points to url [u'image5.html'] and image [u'image5_thumb.jpg']

Using selectors with regular expressions

Selector also has a .re() method for extracting data using regular expressions. However, unlike using .xpath() or .css() methods, .re() returns a list of unicode strings. So you can’t construct nested .re() calls.

Here’s an example used to extract image names from the HTML code above:

>>> response.xpath('//a[contains(@href, "image")]/text()').re(r'Name:s*(.*)')
[u'My image 1',
 u'My image 2',
 u'My image 3',
 u'My image 4',
 u'My image 5']

There’s an additional helper reciprocating .extract_first() for .re(), named .re_first(). Use it to extract just the first matching string:

>>> response.xpath('//a[contains(@href, "image")]/text()').re_first(r'Name:s*(.*)')
u'My image 1'

Working with relative XPaths

Keep in mind that if you are nesting selectors and use an XPath that starts with /, that XPath will be absolute to the document and not relative to the Selector you’re calling it from.

For example, suppose you want to extract all <p> elements inside <div> elements. First, you would get all <div> elements:

>>> divs = response.xpath('//div')

At first, you may be tempted to use the following approach, which is wrong, as it actually extracts all <p> elements from the document, not only those inside <div> elements:

>>> for p in divs.xpath('//p'):  # this is wrong - gets all <p> from the whole document
...     print p.extract()

This is the proper way to do it (note the dot prefixing the .//p XPath):

>>> for p in divs.xpath('.//p'):  # extracts all <p> inside
...     print p.extract()

Another common case would be to extract all direct <p> children:

>>> for p in divs.xpath('p'):
...     print p.extract()

For more details about relative XPaths see the Location Paths section in the XPath specification.

Variables in XPath expressions

XPath allows you to reference variables in your XPath expressions, using the $somevariable syntax. This is somewhat similar to parameterized queries or prepared statements in the SQL world where you replace some arguments in your queries with placeholders like ?, which are then substituted with values passed with the query.

Here’s an example to match an element based on its “id” attribute value, without hard-coding it (that was shown previously):

>>> # `$val` used in the expression, a `val` argument needs to be passed
>>> response.xpath('//div[@id=$val]/a/text()', val='images').extract_first()
u'Name: My image 1 '

Here’s another example, to find the “id” attribute of a <div> tag containing five <a> children (here we pass the value 5 as an integer):

>>> response.xpath('//div[count(a)=$cnt]/@id', cnt=5).extract_first()
u'images'

All variable references must have a binding value when calling .xpath() (otherwise you’ll get a ValueError: XPath error: exception). This is done by passing as many named arguments as necessary.

parsel, the library powering Scrapy selectors, has more details and examples on XPath variables.

Using EXSLT extensions

Being built atop lxml, Scrapy selectors also support some EXSLT extensions and come with these pre-registered namespaces to use in XPath expressions:

prefixnamespaceusage
re http://exslt.org/regular-expressions regular expressions
set http://exslt.org/sets set manipulation

Regular expressions

The test() function, for example, can prove quite useful when XPath’s starts-with() or contains() are not sufficient.

Example selecting links in list item with a “class” attribute ending with a digit:

>>> from scrapy import Selector
>>> doc = """
... <div>
...     <ul>
...         <li class="item-0"><a href="link1.html">first item</a></li>
...         <li class="item-1"><a href="link2.html">second item</a></li>
...         <li class="item-inactive"><a href="link3.html">third item</a></li>
...         <li class="item-1"><a href="link4.html">fourth item</a></li>
...         <li class="item-0"><a href="link5.html">fifth item</a></li>
...     </ul>
... </div>
... """
>>> sel = Selector(text=doc, type="html")
>>> sel.xpath('//li//@href').extract()
[u'link1.html', u'link2.html', u'link3.html', u'link4.html', u'link5.html']
>>> sel.xpath('//li[re:test(@class, "item-d$")]//@href').extract()
[u'link1.html', u'link2.html', u'link4.html', u'link5.html']
>>>

Warning

C library libxslt doesn’t natively support EXSLT regular expressions so lxml‘s implementation uses hooks to Python’s re module. Thus, using regexp functions in your XPath expressions may add a small performance penalty.

Set operations

These can be handy for excluding parts of a document tree before extracting text elements for example.

Example extracting microdata (sample content taken from http://schema.org/Product) with groups of itemscopes and corresponding itemprops:

>>> doc = """
... <div itemscope itemtype="http://schema.org/Product">
...   <span itemprop="name">Kenmore White 17" Microwave</span>
...   <img src="kenmore-microwave-17in.jpg" alt='Kenmore 17" Microwave' />
...   <div itemprop="aggregateRating"
...     itemscope itemtype="http://schema.org/AggregateRating">
...    Rated <span itemprop="ratingValue">3.5</span>/5
...    based on <span itemprop="reviewCount">11</span> customer reviews
...   </div>
...
...   <div itemprop="offers" itemscope itemtype="http://schema.org/Offer">
...     <span itemprop="price">$55.00</span>
...     <link itemprop="availability" href="http://schema.org/InStock" />In stock
...   </div>
...
...   Product description:
...   <span itemprop="description">0.7 cubic feet countertop microwave.
...   Has six preset cooking categories and convenience features like
...   Add-A-Minute and Child Lock.</span>
...
...   Customer reviews:
...
...   <div itemprop="review" itemscope itemtype="http://schema.org/Review">
...     <span itemprop="name">Not a happy camper</span> -
...     by <span itemprop="author">Ellie</span>,
...     <meta itemprop="datePublished" content="2011-04-01">April 1, 2011
...     <div itemprop="reviewRating" itemscope itemtype="http://schema.org/Rating">
...       <meta itemprop="worstRating" content = "1">
...       <span itemprop="ratingValue">1</span>/
...       <span itemprop="bestRating">5</span>stars
...     </div>
...     <span itemprop="description">The lamp burned out and now I have to replace
...     it. </span>
...   </div>
...
...   <div itemprop="review" itemscope itemtype="http://schema.org/Review">
...     <span itemprop="name">Value purchase</span> -
...     by <span itemprop="author">Lucas</span>,
...     <meta itemprop="datePublished" content="2011-03-25">March 25, 2011
...     <div itemprop="reviewRating" itemscope itemtype="http://schema.org/Rating">
...       <meta itemprop="worstRating" content = "1"/>
...       <span itemprop="ratingValue">4</span>/
...       <span itemprop="bestRating">5</span>stars
...     </div>
...     <span itemprop="description">Great microwave for the price. It is small and
...     fits in my apartment.</span>
...   </div>
...   ...
... </div>
... """
>>> sel = Selector(text=doc, type="html")
>>> for scope in sel.xpath('//div[@itemscope]'):
...     print "current scope:", scope.xpath('@itemtype').extract()
...     props = scope.xpath('''
...                 set:difference(./descendant::*/@itemprop,
...                                .//*[@itemscope]/*/@itemprop)''')
...     print "    properties:", props.extract()
...     print

current scope: [u'http://schema.org/Product']
    properties: [u'name', u'aggregateRating', u'offers', u'description', u'review', u'review']

current scope: [u'http://schema.org/AggregateRating']
    properties: [u'ratingValue', u'reviewCount']

current scope: [u'http://schema.org/Offer']
    properties: [u'price', u'availability']

current scope: [u'http://schema.org/Review']
    properties: [u'name', u'author', u'datePublished', u'reviewRating', u'description']

current scope: [u'http://schema.org/Rating']
    properties: [u'worstRating', u'ratingValue', u'bestRating']

current scope: [u'http://schema.org/Review']
    properties: [u'name', u'author', u'datePublished', u'reviewRating', u'description']

current scope: [u'http://schema.org/Rating']
    properties: [u'worstRating', u'ratingValue', u'bestRating']

>>>

Here we first iterate over itemscope elements, and for each one, we look for all itemprops elements and exclude those that are themselves inside another itemscope.

Some XPath tips

Here are some tips that you may find useful when using XPath with Scrapy selectors, based on this post from ScrapingHub’s blog. If you are not much familiar with XPath yet, you may want to take a look first at this XPath tutorial.

Using text nodes in a condition

When you need to use the text content as argument to an XPath string function, avoid using .//text() and use just . instead.

This is because the expression .//text() yields a collection of text elements – a node-set. And when a node-set is converted to a string, which happens when it is passed as argument to a string function like contains() or starts-with(), it results in the text for the first element only.

Example:

>>> from scrapy import Selector
>>> sel = Selector(text='<a href="#">Click here to go to the <strong>Next Page</strong></a>')

Converting a node-set to string:

>>> sel.xpath('//a//text()').extract() # take a peek at the node-set
[u'Click here to go to the ', u'Next Page']
>>> sel.xpath("string(//a[1]//text())").extract() # convert it to string
[u'Click here to go to the ']

node converted to a string, however, puts together the text of itself plus of all its descendants:

>>> sel.xpath("//a[1]").extract() # select the first node
[u'<a href="#">Click here to go to the <strong>Next Page</strong></a>']
>>> sel.xpath("string(//a[1])").extract() # convert it to string
[u'Click here to go to the Next Page']

So, using the .//text() node-set won’t select anything in this case:

>>> sel.xpath("//a[contains(.//text(), 'Next Page')]").extract()
[]

But using the . to mean the node, works:

>>> sel.xpath("//a[contains(., 'Next Page')]").extract()
[u'<a href="#">Click here to go to the <strong>Next Page</strong></a>']

Beware of the difference between //node[1] and (//node)[1]

//node[1] selects all the nodes occurring first under their respective parents.

(//node)[1] selects all the nodes in the document, and then gets only the first of them.

Example:

>>> from scrapy import Selector
>>> sel = Selector(text="""
....:     <ul class="list">
....:         <li>1</li>
....:         <li>2</li>
....:         <li>3</li>
....:     </ul>
....:     <ul class="list">
....:         <li>4</li>
....:         <li>5</li>
....:         <li>6</li>
....:     </ul>""")
>>> xp = lambda x: sel.xpath(x).extract()

This gets all first <li> elements under whatever it is its parent:

>>> xp("//li[1]")
[u'<li>1</li>', u'<li>4</li>']

And this gets the first <li> element in the whole document:

>>> xp("(//li)[1]")
[u'<li>1</li>']

This gets all first <li> elements under an <ul> parent:

>>> xp("//ul/li[1]")
[u'<li>1</li>', u'<li>4</li>']

And this gets the first <li> element under an <ul> parent in the whole document:

>>> xp("(//ul/li)[1]")
[u'<li>1</li>']

When querying by class, consider using CSS

Because an element can contain multiple CSS classes, the XPath way to select elements by class is the rather verbose:

*[contains(concat(' ', normalize-space(@class), ' '), ' someclass ')]

If you use @class='someclass' you may end up missing elements that have other classes, and if you just use contains(@class, 'someclass') to make up for that you may end up with more elements that you want, if they have a different class name that shares the string someclass.

As it turns out, Scrapy selectors allow you to chain selectors, so most of the time you can just select by class using CSS and then switch to XPath when needed:

>>> from scrapy import Selector
>>> sel = Selector(text='<div class="hero shout"><time datetime="2014-07-23 19:00">Special date</time></div>')
>>> sel.css('.shout').xpath('./time/@datetime').extract()
[u'2014-07-23 19:00']

This is cleaner than using the verbose XPath trick shown above. Just remember to use the . in the XPath expressions that will follow.

Built-in Selectors reference

Selector objects

classscrapy.selector.Selector(response=Nonetext=Nonetype=None)

An instance of Selector is a wrapper over response to select certain parts of its content.

response is an HtmlResponse or an XmlResponse object that will be used for selecting and extracting data.

text is a unicode string or utf-8 encoded text for cases when a response isn’t available. Using text and response together is undefined behavior.

type defines the selector type, it can be "html""xml" or None (default).

If type is None, the selector automatically chooses the best type based on responsetype (see below), or defaults to "html" in case it is used together with text.

If type is None and a response is passed, the selector type is inferred from the response type as follows:

Otherwise, if type is set, the selector type will be forced and no detection will occur.

xpath(query)

Find nodes matching the xpath query and return the result as a SelectorList instance with all elements flattened. List elements implement Selector interface too.

query is a string containing the XPATH query to apply.

Note

For convenience, this method can be called as response.xpath()

css(query)

Apply the given CSS selector and return a SelectorList instance.

query is a string containing the CSS selector to apply.

In the background, CSS queries are translated into XPath queries using cssselect library and run .xpath() method.

Note

For convenience this method can be called as response.css()

extract()

Serialize and return the matched nodes as a list of unicode strings. Percent encoded content is unquoted.

re(regex)

Apply the given regex and return a list of unicode strings with the matches.

regex can be either a compiled regular expression or a string which will be compiled to a regular expression using re.compile(regex)

Note

Note that re() and re_first() both decode HTML entities (except &lt; and &amp;).

register_namespace(prefixuri)

Register the given namespace to be used in this Selector. Without registering namespaces you can’t select or extract data from non-standard namespaces. See examples below.

remove_namespaces()

Remove all namespaces, allowing to traverse the document using namespace-less xpaths. See example below.

__nonzero__()

Returns True if there is any real content selected or False otherwise. In other words, the boolean value of a Selector is given by the contents it selects.

SelectorList objects

classscrapy.selector.SelectorList

The SelectorList class is a subclass of the builtin list class, which provides a few additional methods.

xpath(query)

Call the .xpath() method for each element in this list and return their results flattened as another SelectorList.

query is the same argument as the one in Selector.xpath()

css(query)

Call the .css() method for each element in this list and return their results flattened as another SelectorList.

query is the same argument as the one in Selector.css()

extract()

Call the .extract() method for each element in this list and return their results flattened, as a list of unicode strings.

re()

Call the .re() method for each element in this list and return their results flattened, as a list of unicode strings.

Selector examples on HTML response

Here’s a couple of Selector examples to illustrate several concepts. In all cases, we assume there is already a Selector instantiated with a HtmlResponse object like this:

sel = Selector(html_response)
  1. Select all <h1> elements from an HTML response body, returning a list of Selector objects (ie. a SelectorList object):

    sel.xpath("//h1")
    
  2. Extract the text of all <h1> elements from an HTML response body, returning a list of unicode strings:

    sel.xpath("//h1").extract()         # this includes the h1 tag
    sel.xpath("//h1/text()").extract()  # this excludes the h1 tag
    
  3. Iterate over all <p> tags and print their class attribute:

    for node in sel.xpath("//p"):
        print node.xpath("@class").extract()
    

Selector examples on XML response

Here’s a couple of examples to illustrate several concepts. In both cases we assume there is already a Selector instantiated with an XmlResponse object like this:

sel = Selector(xml_response)
  1. Select all <product> elements from an XML response body, returning a list of Selector objects (ie. a SelectorList object):

    sel.xpath("//product")
    
  2. Extract all prices from a Google Base XML feed which requires registering a namespace:

    sel.register_namespace("g", "http://base.google.com/ns/1.0")
    sel.xpath("//g:price").extract()
    

Removing namespaces

When dealing with scraping projects, it is often quite convenient to get rid of namespaces altogether and just work with element names, to write more simple/convenient XPaths. You can use the Selector.remove_namespaces() method for that.

Let’s show an example that illustrates this with GitHub blog atom feed.

First, we open the shell with the url we want to scrape:

$ scrapy shell https://github.com/blog.atom

Once in the shell we can try selecting all <link> objects and see that it doesn’t work (because the Atom XML namespace is obfuscating those nodes):

>>> response.xpath("//link")
[]

But once we call the Selector.remove_namespaces() method, all nodes can be accessed directly by their names:

>>> response.selector.remove_namespaces()
>>> response.xpath("//link")
[<Selector xpath='//link' data=u'<link xmlns="http://www.w3.org/2005/Atom'>,
 <Selector xpath='//link' data=u'<link xmlns="http://www.w3.org/2005/Atom'>,
 ...

If you wonder why the namespace removal procedure isn’t always called by default instead of having to call it manually, this is because of two reasons, which, in order of relevance, are:

  1. Removing namespaces requires to iterate and modify all nodes in the document, which is a reasonably expensive operation to perform for all documents crawled by Scrapy
  2. There could be some cases where using namespaces is actually required, in case some element names clash between namespaces. These cases are very rare though.

https://docs.scrapy.org/en/latest/topics/selectors.html

Scrapy学习系列(一):网页元素查询CSS Selector和XPath Selector

这篇文章主要介绍创建一个简单的spider,顺便介绍一下对网页元素的选取方式(css selector, xpath selector)。

第一步:创建spider工程

打开命令行运行以下命令:

scrapy startproject homelink_selling_index

创建出的工程结构如下:

复制代码
│  scrapy.cfg

│

└─lianjia_shub

    │  items.py

    │  pipelines.py

    │  settings.py

    │  __init__.py

    │

    └─spiders

            __init__.py
复制代码

第二步:定义spider(homelink_selling_index)

需要抓取的页面元素如下图:

导入命名空间:

import scrapy

定义spider:

复制代码
class homelink_selling_index_spider(scrapy.Spider):

    # 定义spider的名字,在调用spider进行crawling的时候会用到:
    #   scrapy crawl <spider.name>
    name = "homelink_selling_index"
    # 如果没有特别指定其他的url,spider会以start_urls中的链接为入口开始爬取
    start_urls = ["http://bj.lianjia.com/ershoufang/pg1tt2/"]

    # parse是scrapy.Spider处理http response的默认入口
    # parse会对start_urls里的所有链接挨个进行处理
    def parse(self, response):
        # 获取当前页面的房屋列表
        #house_lis = response.css('.house-lst .info-panel')
        house_lis = response.xpath('//ul[@class="house-lst"]/li/div[@class="info-panel"]')
        # 把结果输出到文件(在命令行中房屋标题会因为编码原因显示为乱码)
        with open("homelink.log", "wb") as f:
            ## 使用css selector进行操作
            #average_price = response.css('.secondcon.fl li:nth-child(1)').css('.botline a::text').extract_first()
            #f.write("Average Price: " + str(average_price) + "
")
            #yesterday_count = response.css('.secondcon.fl li:last-child').css('.botline strong::text').extract_first()
            #f.write("Yesterday Count: " + str(yesterday_count) + "
")
            #for house_li in house_lis:
            #    link = house_li.css('a::attr("href")').extract_first()             # 获取房屋的链接地址
            #    title = house_li.css('a::text').extract_first()                    # 获取房屋的标题
            #    price = house_li.css('.price .num::text').extract_first()          # 获取房屋的价格

            # 使用xpath selector进行操作
            average_price = response.xpath('//div[@class="secondcon fl"]//li[1]/span[@class="botline"]//a/text()').extract_first()
            f.write("Average Price: " + str(average_price) + "
")
            yesterday_count = response.xpath('//div[@class="secondcon fl"]//li[last()]//span[@class="botline"]/strong/text()').extract_first()
            f.write("Yesterday Count: " + str(yesterday_count) + "
")
            for house_li in house_lis:
                link = house_li.xpath('.//a/@href').extract_first()                 # 注意这里xpath的语法,前面要加上".",否则会从文档根节点而不是当前节点为起点开始查询
                title = house_li.xpath('.//a/text()').extract_first()
                price = house_li.xpath('.//div[@class="price"]/span[@class="num"]/text()').extract_first()
                f.write("Title: {0}	Price:{1}
	Link: {2}
".format(title.encode('utf-8'), price, link))
复制代码

第三步:查看结果

复制代码
Average Price: 44341
Yesterday Count: 33216
Title: 万科假日风景全明格局 南北精装三居 满五唯一	Price:660
	Link: http://bj.lianjia.com/ershoufang/xxx.html
Title: 南北通透精装三居 免税带车位 前后对花园 有钥匙	Price:910
	Link: http://bj.lianjia.com/ershoufang/xxx.html
Title: 西直门 时代之光名苑 西南四居 满五唯一 诚心出售	Price:1200
	Link: http://bj.lianjia.com/ershoufang/xxx.html
......
复制代码

结语:

通过上面的三步,我们可以对网页元素进行简单的爬取操作了。但是这里还没有真正利用好Scrapy提供给我们的很多方便、强大的功能,比如: ItemLoader, Pipeline等。这些操作会在后续的文章中继续介绍。

https://www.cnblogs.com/silverbullet11/p/scrapy_series_1.html

原文地址:https://www.cnblogs.com/softidea/p/8033286.html