目标:爬取全国报刊名称及地址
链接:http://news.xinhuanet.com/zgjx/2007-09/13/content_6714741.htm
目的:练习scrapy爬取数据
学习过scrapy的基本使用方法后,我们开始写一个最简单的爬虫吧。
目标截图:
1、创建爬虫工程
$ cd ~/code/crawler/scrapyProject $ scrapy startproject newSpapers
2、创建爬虫程序
$ cd newSpapers/ $ scrapy genspider nationalNewspaper news.xinhuanet.com
3、配置数据爬取项
$ cat items.py # -*- coding: utf-8 -*- # Define here the models for your scraped items # # See documentation in: # http://doc.scrapy.org/en/latest/topics/items.html import scrapy class NewspapersItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() name = scrapy.Field() addr = scrapy.Field()
4、 配置爬虫程序
$ cat spiders/nationalNewspaper.py # -*- coding: utf-8 -*- import scrapy from newSpapers.items import NewspapersItem class NationalnewspaperSpider(scrapy.Spider): name = "nationalNewspaper" allowed_domains = ["news.xinhuanet.com"] start_urls = ['http://news.xinhuanet.com/zgjx/2007-09/13/content_6714741.htm'] def parse(self, response): sub_country = response.xpath('//*[@id="Zoom"]/div/table/tbody/tr[2]') sub2_local = response.xpath('//*[@id="Zoom"]/div/table/tbody/tr[4]') tags_a_country = sub_country.xpath('./td/table/tbody/tr/td/p/a') items = [] for each in tags_a_country: item = NewspapersItem() item['name'] = each.xpath('./strong/text()').extract() item['addr'] = each.xpath('./@href').extract() items.append(item) return items
5、配置谁去处理爬取结果
$ cat settings.py …… #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage' ITEM_PIPELINES = {'newSpapers.pipelines.NewspapersPipeline':100}
6、配置数据处理程序
$ cat pipelines.py # -*- coding: utf-8 -*- # Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html import time class NewspapersPipeline(object): def process_item(self, item, spider): now = time.strftime('%Y-%m-%d',time.localtime()) filename = 'newspaper.txt' print '=================' print item print '================' with open(filename,'a') as fp: fp.write(item['name'][0].encode("utf8")+ ' ' +item['addr'][0].encode("utf8") + ' ') return item
7、查看结果
$ cat spiders/newspaper.txt 人民日报 http://paper.people.com.cn/rmrb/html/2007-09/20/node_17.htm 海外版 http://paper.people.com.cn/rmrbhwb/html/2007-09/20/node_34.htm 光明日报 http://www.gmw.cn/01gmrb/2007-09/20/default.htm 经济日报 http://www.economicdaily.com.cn/no1/ 解放军报 http://www.gmw.cn/01gmrb/2007-09/20/default.htm 中国日报 http://pub1.chinadaily.com.cn/cdpdf/cndy/
程序源代码: