scrapy爬虫实例(1)

爬虫实例

对象阳光问政平台
目标 : 主题,时间,内容
爬取思路

预先设置好items

import scrapy
class SuperspiderItem(scrapy.Item):
    title = scrapy.Field()
    date = scrapy.Field()
    content = scrapy.Field()

爬取范围和start_url

class Spider1Spider(scrapy.Spider):
    name = 'spider1'
    allowed_domains = ['http://wz.sun0769.com/']
    start_urls = ['http://wz.sun0769.com/html/top/report.shtml']

parse实现三大大功能抓取具体内容url链接和下一页url链接,并提取title和date

    def parse(self, response):
        tr_list = response.xpath("//div[@class='newsHead clearfix']/table[2]//tr")
        for tr in tr_list:
            items = SuperspiderItem()
            items['title'] = tr.xpath("./td[3]/a[1]/@title").extract_first()  ##### 提取title  用xpath
            items['date'] = tr.xpath("./td[6]//text()").extract_first()    #### 同样的方法提取date
            content_href = tr.xpath("./td[3]/a[1]/@href").extract_first()   #### 提取内容链接
  ####---将提取的内容链接交给下一个函数,并将date和title也交给下一个函数最终数据统一处理---#########
  ####---有关yiled----####----content_url传url链接,callback指定回调函数----####
            yield scrapy.Request(
                content_href,
                callback=self.get_content,
     ####----meta-可以将数据转移----####
     ####----一个类字典的数据类型----####
                meta={
                    'date': items['date'],
                    'title': items['title']
                      }
            )
        new_url = response.xpath("//div[contains(@align,'center')]//@href").extract()
        print(new_url[-2])
        if "page="+str(page_num*30) not in new_url[-2]:
   ####---指明爬取的页数---####
            yield scrapy.Request(
                new_url[-2],
                callback=self.parse
            )

第二个函数
-汇集所有的函数并传给piplines

    def get_content(self, response):
        items = SuperspiderItem()
        items['date'] = response.meta['date']
        items['title'] = response.meta['title']
        items['content'] = response.xpath("//td[@class='txt16_3']/text()").extract_first()
        yield items

piplines里面并没做什么.因为没对数据进行什么处理,只是简单的将数据打印

class SuperspiderPipeline(object):
    def process_item(self, item, spider):
        items = item
        print('*'*100)
        print(items['date'])
        print(items['title'])
        print(items['content'])

完整代码

items里面的部分


import scrapy

class SuperspiderItem(scrapy.Item):
    title = scrapy.Field()
    date = scrapy.Field()
    content = scrapy.Field()

spider代码

# -*- coding: utf-8 -*-
import scrapy
from superspider.items import SuperspiderItem
page_num = 3
class Spider1Spider(scrapy.Spider):
    name = 'spider1'
    allowed_domains = ['wz.sun0769.com']
    start_urls = ['http://wz.sun0769.com/html/top/report.shtml']

    def parse(self, response):
        tr_list = response.xpath("//div[@class='newsHead clearfix']/table[2]//tr")
        for tr in tr_list:
            items = SuperspiderItem()
            items['title'] = tr.xpath("./td[3]/a[1]/@title").extract_first()
            items['date'] = tr.xpath("./td[6]//text()").extract_first()
            content_href = tr.xpath("./td[3]/a[1]/@href").extract_first()
            yield scrapy.Request(
                content_href,
                callback=self.get_content,
                meta={
                    'date': items['date'],
                    'title': items['title']
                      }
            )
        new_url = response.xpath("//div[contains(@align,'center')]//@href").extract()
        print(new_url[-2])
        if "page="+str(page_num*30) not in new_url[-2]:
            yield scrapy.Request(
                new_url[-2],
                callback=self.parse
            )

    def get_content(self, response):
        items = SuperspiderItem()
        items['date'] = response.meta['date']
        items['title'] = response.meta['title']
        items['content'] = response.xpath("//td[@class='txt16_3']/text()").extract_first()
        yield items

piplines代码

class SuperspiderPipeline(object):
    def process_item(self, item, spider):
        items = item
        print('*'*100)
        print(items['date'])
        print(items['title'])
        print(items['content'])

中间遇到的问题

爬取范围写错而日志等级又设置为warning,导致找不出问题
yield相关内容不清楚
要先导入并初始化一个SuperspiderItem()(加括号)
piplines中不需要导入SuperspiderItem()
extract()忘写
xpath://div[contains(@align,'center')注意写法
- 找到一篇xpath定位的博文