scrapy初试

python3  支持 scrapy了。

通过pycharm的菜单file-default setting-project interpreter,进行搜索安装;

通过如下pip也可安装:

$ pip install scrapy==1.1.0rc1

scrapy下的每个item对象表示网站的一个页面。可以定义不同的item(url,content,header,image)

首先,在当前目录下创建scrapy项目:

$scrapy startproject wikiSpider

会新建一个wikiSpider的项目文件夹,目录中有item.py、settings.py、spiders文件夹等;

在spider文件夹下新建articleSpider.py:

from scrapy import Spider
from wikiSpider.items import Article

class ArticleSpider(Spider):
    name = 'article'
    allowed_domains = ['en.wikipedia.org']
    start_urls = ['http://en.wikipedia.org/wiki/Main_Page', 'http://en.wikipedia.org/wiki/Python_%28programming_language%29']
    def parse(self, response):
        item = Article()
        title = response.xpath('//h1/text()')[0].extract()
        print('title is :'+title)
        item['title'] = title
        return item

把item.py改成:

from scrapy import Item,Field


class Article(Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = Field()
    pass

同时在setting.py中修改日志,方便查看输出结果:

LOG_LEVEL = 'ERROR'

然后在wikiSpider主目录中运行:

$scrapy crawl article

可以出现调试信息:

title is :Main Page
title is :Python (programming language)
原文地址:https://www.cnblogs.com/vivivi/p/5917577.html