scrapy --爬取媒体文件示例详解

scrapy 图片数据的爬取

  • 基于scrapy进行图片数据的爬取:

    • 在爬虫文件中只需要解析提取出图片地址,然后将地址提交给管道
    • 配置文件中写入文件存储位置:IMAGES_STORE = './imgsLib'
    • 在管道文件中进行管道类的制定:
      • 1.from scrapy.pipelines.images import ImagesPipeline
      • 2.将管道类的父类修改成ImagesPipeline
      • 3.重写父类的三个方法
  • 校花网爬取示例

    • spider.py文件

      import scrapy
      from imgspider.items import ImgspiderItem
      
      
      class ImgSpiderSpider(scrapy.Spider):
          name = 'img_spider'
          # allowed_domains = ['www.xxx.com']
          start_urls = ['http://www.521609.com/daxuemeinv/']
          url = 'http://www.521609.com/daxuemeinv/list8%d.html'
          pageNum = 1
      
          def parse(self, response):
              li_list = response.xpath('//*[@id="content"]/div[2]/div[2]/ul/li')
              # 拼接图片url
              for li in li_list:
                  print(self.pageNum)
                  img_src = 'http://www.521609.com' + li.xpath('./a[1]/img/@src').extract_first()
                  item = ImgspiderItem()
                  item['src'] = img_src
                  yield item
      
                  if self.pageNum < 3:
                      self.pageNum += 1
                      new_url = format(self.url % self.pageNum)
                      yield scrapy.Request(new_url, callback=self.parse)
      
    • pipelines.py文件

      import scrapy
      from imgspider.items import ImgspiderItem
      
      
      class ImgSpiderSpider(scrapy.Spider):
          name = 'img_spider'
          # allowed_domains = ['www.xxx.com']
          start_urls = ['http://www.521609.com/daxuemeinv/']
          url = 'http://www.521609.com/daxuemeinv/list8%d.html'
          pageNum = 1
      
          def parse(self, response):
              li_list = response.xpath('//*[@id="content"]/div[2]/div[2]/ul/li')
              # 拼接图片url
              for li in li_list:
                  print(self.pageNum)
                  img_src = 'http://www.521609.com' + li.xpath('./a[1]/img/@src').extract_first()
                  item = ImgspiderItem()
                  item['src'] = img_src
                  yield item
      
                  if self.pageNum < 3:
                      self.pageNum += 1
                      new_url = format(self.url % self.pageNum)
                      yield scrapy.Request(new_url, callback=self.parse)
      
原文地址:https://www.cnblogs.com/bigox/p/11447918.html