scrapy爬虫第一站

http://www.jianshu.com/p/fa614bea98eb

按照这个教程一点一点做的,自己再记录一下,爬取评分在9分以上的书籍

首先是需要创建一个scrapy项目,在cmd中输入 scrapy startproject douban

然后在pycharm中打开项目,新建一个main文件(和scrapy.cfg在同一级目录下)

from scrapy import cmdline
cmdline.execute("scrapy crawl dbbook".split())

dbbook是爬虫名,根据自定义的名字进行修改

在sprider文件中新建一个python文件:

import scrapy
import re
from practise.items import doubanBook

class DbbookSpider(scrapy.Spider):
    name="dbbook"
    start_urls=('https://www.douban.com/doulist/1264675/',)     #爬取的网站

    def parse(self,response):
         item = doubanBook()
         selector = scrapy.Selector(response)
         books = selector.xpath('//div[@class="bd doulist-subject"]')
         for each in books:
             title=each.xpath('div[@class="title"]/a/text()').extract()[0]
             title=title.replace(' ','').replace('
','')
             rate=each.xpath('div[@class="rating"]/span[@class="rating_nums"]/text()').extract()[0]
             author=re.search('<div class="abstract">(.*?)<br',each.extract(),re.S).group(1)
             author = author.replace(' ', '').replace('
', '')
             item['title']=title
             item['rate']=rate
             item['author']=author
             yield item

             nextPage = selector.xpath('//div[@class="paginator"]/span[@class="next"]/link/@href').extract() #获取下一页的链接
             if nextPage:
                 next=nextPage[0]
                 yield scrapy.http.Request(next,callback=self.parse)

items.py文件是用来存储数据的容器,可以考虑成一个结构体,所有需要提取的信息都在这里面存着。

class doubanBook(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title=scrapy.Field()
    rate=scrapy.Field()
    author=scrapy.Field()

创建三个我们需要存取的数据

最后修改setting.py输出所爬取的数据:

USER_AGENT = 'Mozilla/5.0 (Windows NT 6.3; WOW64; rv:45.0) Gecko/20100101 Firefox/45.0'

FEED_URI = u'file:///G://douban.csv'
FEED_FORMAT = 'CSV'

附上re库正则表达式的学习网站用以复习:http://cuiqingcai.com/977.html

原文地址:https://www.cnblogs.com/Qmelbourne/p/6728913.html