scrapy 图片管道学习笔记

使用scrapy首先需要安装 

python环境使用3.6  

windows下激活进入python3.6环境

activate python36 

mac下 

mac@macdeMacBook-Pro:~$     source activate python36
(python36) mac@macdeMacBook-Pro:~$  

安装 scrapy

(python36) mac@macdeMacBook-Pro:~$     pip install scrapy
(python36) mac@macdeMacBook-Pro:~$     scrapy --version
Scrapy 1.8.0 - no active project

Usage:
  scrapy <command> [options] [args]

Available commands:
  bench         Run quick benchmark test
  fetch         Fetch a URL using the Scrapy downloader
  genspider     Generate new spider using pre-defined templates
  runspider     Run a self-contained spider (without creating a project)
  settings      Get settings values
  shell         Interactive scraping console
  startproject  Create new project
  version       Print Scrapy version
  view          Open URL in browser, as seen by Scrapy

  [ more ]      More commands available when run from project directory

Use "scrapy <command> -h" to see more info about a command
(python36) mac@macdeMacBook-Pro:~$     scrapy startproject images
New Scrapy project 'images', using template directory '/Users/mac/anaconda3/envs/python36/lib/python3.6/site-packages/scrapy/templates/project', created in:
    /Users/mac/images

You can start your first spider with:
    cd images
    scrapy genspider example example.com

(python36) mac@macdeMacBook-Pro:~$     cd images
(python36) mac@macdeMacBook-Pro:~/images$     scrapy genspider -t crawl pexels www.pexels.com
Created spider 'pexels' using template 'crawl' in module:
  images.spiders.pexels
(python36) mac@macdeMacBook-Pro:~/images$  

setting.py里面 关闭robot.txt遵循

ROBOTSTXT_OBEY = False

分析目标网站规则 www.pexels.com

https://www.pexels.com/photo/man-using-black-camera-3136161/

https://www.pexels.com/video/beach-waves-and-sunset-855633/

https://www.pexels.com/photo/white-vehicle-2569855/

https://www.pexels.com/photo/monochrome-photo-of-city-during-daytime-3074526/

得出要抓取的规则

rules = (
Rule(LinkExtractor(allow=r'^https://www.pexels.com/photo/.*/$'), callback='parse_item', follow=True),
)


图片管道 要定义两个item
class ImagesItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    image_urls = scrapy.Field()
    images = scrapy.Field()

images_url是抓取到的图片url 需要传递过来

images 检测图片完整性,但是我打印好像没看到这个字段

pexels.py里面引入item 并且定义对象

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule

from images.items import ImagesItem

class PexelsSpider(CrawlSpider):
    name = 'pexels'
    allowed_domains = ['www.pexels.com']
    start_urls = ['http://www.pexels.com/']

    rules = (
        Rule(LinkExtractor(allow=r'^https://www.pexels.com/photo/.*/$'), callback='parse_item', follow=False),
    )

    def parse_item(self, response):
        item = ImagesItem()
        item['image_urls'] = response.xpath('//img[contains(@src,"photos")]/@src').extract()
        print(item['image_urls'])
        return item

 设置setting.py里面启用图片管道 设置存储路劲

ITEM_PIPELINES = {
   #'images.pipelines.ImagesPipeline': 300,
    'scrapy.pipelines.images.ImagesPipeline': 1
}



IMAGES_STORE = '/www/crawl'
# 图片的下载地址 根据item中的字段来设置哪一个内容需要被下载
IMAGES_URLS_FIELD = 'image_urls'

启动爬虫

scrapy crawl pexels --nolog

发现已经下载下来了

但是下载的图片不是高清的,要处理下图片的后缀

setting.py打开默认管道 设置优先级高一些

ITEM_PIPELINES = {
    'images.pipelines.ImagesPipeline': 1,
    'scrapy.pipelines.images.ImagesPipeline': 2
}

管道文件里面对后缀进行处理去掉

class ImagesPipeline(object):
    def process_item(self, item, spider):
        tmp = item['image_urls']
        item['image_urls'] = []
        for i in tmp:
            if '?' in i:
                item['image_urls'].append(i.split('?')[0])
            else:
                item['image_urls'].append(i)

        return item

最终下载的就是大图了,但是图片管道还是默认对图片会有压缩的,所以如果使用文件管道下载的才是完全的原图,非常大。

如果不下载图片,直接存图片url到mysql的话参考 

https://www.cnblogs.com/php-linux/p/11792393.html

图片管道 配置最小宽度和高度分辨率

IMAGES_MIN_HEIGHT=800

IMAGES_MIN_WIDTH=600

IMAGES_EXPIRES=90 天 不会对重复的进行下载

生成缩略图

IMAGES_THUMBS={

  ‘small’:(50,50),

     'big':(600,600)

}

原文地址:https://www.cnblogs.com/brady-wang/p/11795582.html