python爬虫---->scrapy的使用(一)

  这里我们介绍一下python的分布式爬虫框架scrapy的安装以及使用。平庸这东西犹如白衬衣上的污痕,一旦染上便永远洗不掉,无可挽回。

scrapy的安装使用

我的电脑环境是win10,64位的。python版本是3.6.3。以下是安装以及学习scrapy的第一个案例。

一、scrapy的安装准备

直接运行以下命令

pip install scrapy

由于我的电脑上面没有安装Microsoft Visual C++ 14.0。会出现如下的错误。

building 'twisted.test.raiser' extension
    error: Microsoft Visual C++ 14.0 is required. Get it with "Microsoft Visual C++ Build Tools": http://landinghub.visualstudio.com/visual-cpp-build-tools

解决方案有两种,一种是安装Microsoft Visual C++ Build Tools。这个比较大,这里我没有使用这种方式。可以直接安装网上已经编译好的twisted版本。可以在https://www.lfd.uci.edu/~gohlke/pythonlibs上找到已经编译好的python库。我们找到scrapy需要的twisted库。cp36表示python版本3.6,amd64表示64位。

下载安装之后,运行以下命令安装Twisted。

pip install D:360DownloadTwisted-17.9.0-cp36-cp36m-win_amd64.whl

最后再运行 pip install scrapy可以成功安装。

whl格式本质上是一个压缩包,里面包含了py文件,以及经过编译的pyd文件。使得可以在不具备编译环境的情况下,选择合适自己的python环境进行安装。 

二、运行scrapy的第一个案例

创建python文件quotes_spider.py,内容如下

import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/tag/humor/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').extract_first(),
                'author': quote.xpath('span/small/text()').extract_first(),
            }

        next_page = response.css('li.next a::attr("href")').extract_first()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

在相应的目录下运行命令

scrapy runspider quotes_spider.py -o quotes.json

以上会出现以下的错误:

    import win32api
ModuleNotFoundError: No module named 'win32api'

需要安装win32api,地址https://sourceforge.net/projects/pywin32/files/pywin32/Build%20221/。这里我们选择安装.

安装完之后,重新运行scrapy runspider quotes_spider.py -o quotes.json,可以看到成功的生成quotes.json文件。内容如下

[
{"text": "u201cThe person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.u201d", "author": "Jane Austen"},
{"text": "u201cA day without sunshine is like, you know, night.u201d", "author": "Steve Martin"},
{"text": "u201cAnyone who thinks sitting in church can make you a Christian must also think that sitting in a garage can make you a car.u201d", "author": "Garrison Keillor"},
{"text": "u201cBeauty is in the eye of the beholder and it may be necessary from time to time to give a stupid or misinformed beholder a black eye.u201d", "author": "Jim Henson"},
{"text": "u201cAll you need is love. But a little chocolate now and then doesn't hurt.u201d", "author": "Charles M. Schulz"},
{"text": "u201cRemember, we're madly in love, so it's all right to kiss me anytime you feel like it.u201d", "author": "Suzanne Collins"},
{"text": "u201cSome people never go crazy. What truly horrible lives they must lead.u201d", "author": "Charles Bukowski"},
{"text": "u201cThe trouble with having an open mind, of course, is that people will insist on coming along and trying to put things in it.u201d", "author": "Terry Pratchett"},
{"text": "u201cThink left and think right and think low and think high. Oh, the thinks you can think up if only you try!u201d", "author": "Dr. Seuss"},
{"text": "u201cThe reason I talk to myself is because Iu2019m the only one whose answers I accept.u201d", "author": "George Carlin"},
{"text": "u201cI am free of all prejudice. I hate everyone equally. u201d", "author": "W.C. Fields"},
{"text": "u201cA lady's imagination is very rapid; it jumps from admiration to love, from love to matrimony in a moment.u201d", "author": "Jane Austen"}
]

友情链接

原文地址:https://www.cnblogs.com/huhx/p/baseusepythonscrapy1.html