爬虫之scrapy框架

概述

scrapy是为了爬取网站数据提取数据而写的框架,内置了多功能,通用性强,容易学习的一个爬虫框架

安装scrapy

pip install scrapy -i https://pypi.douban.com/simple

在window中安装scrapy需要安装twisted和pywin32,安装twisted,在http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted下载,然后通过cmd进入目录输入命令pip install Twisted-19.2.1-cp36-cp36m-win_amd64.whl安装,安装pywin32之间pip install pywin32 -i https://pypi.couban.com/simple

使用

创建项目:通过命令行输入scrapy startproject 项目名

创建普通爬虫文件:cd到项目中scrapy genspider 爬虫文件名初始url(随便设置,后面可以自己在爬虫文件中设置)

启动普通爬虫文件项目:scrapy crawl 文件名

启动项目是有可选参数 --nolog是取消查看日志信息, -o 文件名是将结果输出到自定义文件名中,一般情况下不用

项目创建有以下目录所示

└─myproject
    │  items.py                # 定义提交到管道的属性
    │  middlewares.py      # 中间件文件
    │  pipelines.py           # 管道文件
    │  settings.py            # 配置文件
    │  __init__.py
    │
    ├─spiders                  # 爬虫文件夹
    │  │  project.py         # 自定义创建的爬虫文件
    │  │  __init__.py

各个文件的使用

settings.py(之挑选了经常使用的配置)

# UA伪装字段
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36'

# 日志显示配置
LOG_LEVEL = 'INFO'

# robots协议配置
ROBOTSTXT_OBEY = False

#线程数
CONCURRENT_REQUESTS = 32

#开启中间件
DOWNLOADER_MIDDLEWARES = {
   'wangyi.middlewares.WangyiDownloaderMiddleware': 543,
}

#管道开启
ITEM_PIPELINES = {
   'wangyi.pipelines.WangyiPipeline': 300,
}

爬虫文件

# -*- coding: utf-8 -*-
import scrapy


class ProjectSpider(scrapy.Spider):
    name = 'project'
    # 允许爬虫的网站,一般情况下注释
    # allowed_domains = ['www.xxx.com']
    # 看是爬虫的初始url
    start_urls = ['http://www.xxx.com/']

    def parse(self, response):
        '''
        访问url后的回调函数
        :param response: 响应对象
        :return:
        '''
        pass

items.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class MyprojectItem(scrapy.Item):
    '''
    提交管道字段,用scrapy.Field()定义即可
    '''
    # define the fields for your item here like:
    # name = scrapy.Field()
    pass

middlewares.py

由于在中间件中有自动生成的两个类,这里只介绍其中一个常用的类,且只介绍里面常用的方法常用的

# -*- coding: utf-8 -*-

# Define here the models for your spider middleware
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html

from scrapy import signals

# 一般情况下不适用该各类
class MyprojectSpiderMiddleware(object):
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the spider middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_spider_input(self, response, spider):
        # Called for each response that goes through the spider
        # middleware and into the spider.

        # Should return None or raise an exception.
        return None

    def process_spider_output(self, response, result, spider):
        # Called with the results returned from the Spider, after
        # it has processed the response.

        # Must return an iterable of Request, dict or Item objects.
        for i in result:
            yield i

    def process_spider_exception(self, response, exception, spider):
        # Called when a spider or process_spider_input() method
        # (from other spider middleware) raises an exception.

        # Should return either None or an iterable of Request, dict
        # or Item objects.
        pass

    def process_start_requests(self, start_requests, spider):
        # Called with the start requests of the spider, and works
        # similarly to the process_spider_output() method, except
        # that it doesn’t have a response associated.

        # Must return only requests (not items).
        for r in start_requests:
            yield r

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)


class MyprojectDownloaderMiddleware(object):

    def process_request(self, request, spider):
        '''
        拦截所用能够正常访问的请求,即请求前经过治理
        :param request: 请求对象
        :param spider: 爬虫文件中的类实例化的对象,可以调用其中的属性和方法
        :return:
        '''
        return None

    def process_response(self, request, response, spider):
        '''
        拦截所有响应,可以在这类对响应对象进行处理
        :param request: 请求对象
        :param response: 响应对象
        :param spider: 爬虫文件中的类实例化的对象,可以调用其中的属性和方法
        :return:
        '''
        return response

    def process_exception(self, request, exception, spider):
        '''
        拦截所用能够异常访问的请求,即请求前经过治理
        :param request: 请求对象
        :param spider: 爬虫文件中的类实例化的对象,可以调用其中的属性和方法
        :return:
        '''
        pass

View Code

pipelines.py

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


class MyprojectPipeline(object):
    def process_item(self, item, spider):
        # 计较过来的数据通过item接受,可以理解为字典,字典的键是items.py种定义的字段,值是爬虫文件里提交过来的数据
        return item