scrapy爬虫实例(1)

爬虫实例

  1. 预先设置好items
import scrapy
class SuperspiderItem(scrapy.Item):
    title = scrapy.Field()
    date = scrapy.Field()
    content = scrapy.Field()
  1. 爬取范围和start_url
class Spider1Spider(scrapy.Spider):
    name = 'spider1'
    allowed_domains = ['http://wz.sun0769.com/']
    start_urls = ['http://wz.sun0769.com/html/top/report.shtml']
  1. parse实现三大大功能抓取具体内容url链接和下一页url链接,并提取title和date
    def parse(self, response):
        tr_list = response.xpath("//div[@class='newsHead clearfix']/table[2]//tr")
        for tr in tr_list:
            items = SuperspiderItem()
            items['title'] = tr.xpath("./td[3]/a[1]/@title").extract_first()  ##### 提取title  用xpath
            items['date'] = tr.xpath("./td[6]//text()").extract_first()    #### 同样的方法提取date
            content_href = tr.xpath("./td[3]/a[1]/@href").extract_first()   #### 提取内容链接
  ####---将提取的内容链接交给下一个函数,并将date和title也交给下一个函数最终数据统一处理---#########
  ####---有关yiled----####----content_url传url链接,callback指定回调函数----####
            yield scrapy.Request(
                content_href,
                callback=self.get_content,
     ####----meta-可以将数据转移----####
     ####----一个类字典的数据类型----####
                meta={
                    'date': items['date'],
                    'title': items['title']
                      }
            )
        new_url = response.xpath("//div[contains(@align,'center')]//@href").extract()
        print(new_url[-2])
        if "page="+str(page_num*30) not in new_url[-2]:
   ####---指明爬取的页数---####
            yield scrapy.Request(
                new_url[-2],
                callback=self.parse
            )
  1. 第二个函数
    -汇集所有的函数并 传给piplines
    def get_content(self, response):
        items = SuperspiderItem()
        items['date'] = response.meta['date']
        items['title'] = response.meta['title']
        items['content'] = response.xpath("//td[@class='txt16_3']/text()").extract_first()
        yield items
  1. piplines里面并没做什么.因为没对数据进行什么处理,只是简单的将数据打印
class SuperspiderPipeline(object):
    def process_item(self, item, spider):
        items = item
        print('*'*100)
        print(items['date'])
        print(items['title'])
        print(items['content'])

完整代码

  • items里面的部分

import scrapy

class SuperspiderItem(scrapy.Item):
    title = scrapy.Field()
    date = scrapy.Field()
    content = scrapy.Field()
  • spider代码
# -*- coding: utf-8 -*-
import scrapy
from superspider.items import SuperspiderItem
page_num = 3
class Spider1Spider(scrapy.Spider):
    name = 'spider1'
    allowed_domains = ['wz.sun0769.com']
    start_urls = ['http://wz.sun0769.com/html/top/report.shtml']

    def parse(self, response):
        tr_list = response.xpath("//div[@class='newsHead clearfix']/table[2]//tr")
        for tr in tr_list:
            items = SuperspiderItem()
            items['title'] = tr.xpath("./td[3]/a[1]/@title").extract_first()
            items['date'] = tr.xpath("./td[6]//text()").extract_first()
            content_href = tr.xpath("./td[3]/a[1]/@href").extract_first()
            yield scrapy.Request(
                content_href,
                callback=self.get_content,
                meta={
                    'date': items['date'],
                    'title': items['title']
                      }
            )
        new_url = response.xpath("//div[contains(@align,'center')]//@href").extract()
        print(new_url[-2])
        if "page="+str(page_num*30) not in new_url[-2]:
            yield scrapy.Request(
                new_url[-2],
                callback=self.parse
            )

    def get_content(self, response):
        items = SuperspiderItem()
        items['date'] = response.meta['date']
        items['title'] = response.meta['title']
        items['content'] = response.xpath("//td[@class='txt16_3']/text()").extract_first()
        yield items
  • piplines代码
class SuperspiderPipeline(object):
    def process_item(self, item, spider):
        items = item
        print('*'*100)
        print(items['date'])
        print(items['title'])
        print(items['content'])

中间遇到的问题

  • 爬取范围写错而日志等级又设置为warning,导致找不出问题
  • yield相关内容不清楚
  • 要先导入并初始化一个SuperspiderItem()(加括号)
  • piplines中不需要导入SuperspiderItem()
  • extract()忘写
  • xpath://div[contains(@align,'center')注意写法
原文地址:https://www.cnblogs.com/l0nmar/p/12553851.html