Python 创建项目时配置 Scrapy 自定义模板

Python 创建项目时配置 Scrapy 自定义模板

1.找到 Scrapy 自定义模板相关文件

python安装目录+PythonLibsite-packagesscrapy emplatesprojectmodule

 

2.开始编写 Python 自定义模板

settings.py.tmpl:Python项目框架的配置类

# Scrapy settings for $project_name project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = '$project_name'

SPIDER_MODULES = ['$project_name.spiders']
NEWSPIDER_MODULE = '$project_name.spiders'

'''
Scrapy 提供 5 层 Log Level:
CRITICAL - 严重错误(critical)
ERROR - 一般错误(regular errors)
WARNING - 警告信息(warning messages)
INFO - 一般信息(informational messages)
DEBUG - 调试信息(debugging messages)
'''
LOG_LEVEL = 'WARNING'

'''
有一些网站不喜欢被爬虫程序访问,所以会检测连接对象;
如果是爬虫程序,也就是非人点击访问,它就会不让你继续访问;
所以为了要让程序可以正常运行,需要隐藏自己的爬虫程序的身份。
此时,可以通过设置User Agent的来达到隐藏身份的目的,User Agent的中文名为用户代理,简称UA。
'''
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = '$project_name (+http://www.yourdomain.com)'
USER_AGENT = 'Mozilla/5.0'
'''
USER_AGENT = {"User-Agent": random.choice(
    ['Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6',
     'Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5',
     'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER',
     'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)',
     'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11',
     'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
     'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)',
     'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11',
     'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SV1; QQDownload 732; .NET4.0C; .NET4.0E; 360SE)',
     'Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)',
     'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20',
     'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6',
     'Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10',
     'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER',
     'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1',
     'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)',
     'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.12) Gecko/20070731 Ubuntu/dapper-security Firefox/1.5.0.12',
     'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E; LBBROWSER)',
     'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1',
     'Mozilla/5.0 (iPhone; CPU iPhone OS 10_3 like Mac OS X) AppleWebKit/603.1.30 (KHTML, like Gecko) Version/10.3 Mobile/14E277 Safari/603.1.30',
     'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'])}
'''


'''
Obey robots.txt rules
robots.txt 是遵循 Robot协议 的一个文件,它保存在网站的服务器中
作用:告诉搜索引擎爬虫,本网站哪些目录下的网页 不希望 你进行爬取收录。在Scrapy启动后,会在第一时间访问网站的 robots.txt 文件,然后决定该网站的爬取范围。
当然,我们并不是在做搜索引擎,而且在某些情况下我们想要获取的内容恰恰是被 robots.txt 所禁止访问的。所以,某些时候,我们就要将此配置项设置为 False ,拒绝遵守 Robot协议 !
'''
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 1  # 延迟下载,防止被封
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    '$project_name.middlewares.${ProjectName}SpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    '$project_name.middlewares.${ProjectName}DownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
# 禁用扩展(Disabling an extension)(avoid twisted.internet.error.CannotListenError)
EXTENSIONS = {
    'scrapy.extensions.telnet.TelnetConsole': None,
}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
    '$project_name.pipelines.${ProjectName}Pipeline': 300,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
settings.py.tmpl View Code

pipelines.py.tmpl:Python项目框架的管道控制类

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
import pymysql

'''
@Author: System
@Date: newTimes
@Description: TODO 管道配置
'''
class ${ProjectName}Pipeline:
    '''
    @Author: System
    @Date: newTimes
    @Description: TODO 配置数据库连接>>>打开链接
    '''

    def open_spider(self, spider):
        # 数据库连接
        self.conn = pymysql.connect(
            host='127.0.0.1',  #服务器IP
            port=3306,  # 服务器端口号:不是字符串不需要加引号。
            user='root',
            password='123456',
            db='database',
            charset='utf8')
        # 得到一个可以执行SQL语句的光标对象
        self.cursor = self.conn.cursor()  # 执行完毕返回的结果集默认以元组显示
        print(spider.name, '打开数据库连接,爬虫开始了...')

    '''
    @Author: System
    @Date: newTimes
    @Description: TODO 配置数据库连接>>>关闭链接
    '''

    def close_spider(self, spider):
        self.cursor.close()
        self.conn.close()
        # self.file.close()
        print(spider.name, '数据库连接关闭,爬虫结束了...')

    '''
    @Author: System
    @Date: newTimes
    @Description: TODO 默认进入此方法
    '''

    def process_item(self, item, spider):
        print("into Pipeline's process_item")
        if spider.name == 'first_py':
            print("into pipeline if")
            self.save_test(item)
        else:
            print("into pipeline else")
        return item

    '''
    @Author: System
    @Date: newTimes
    @Description: TODO 保存test表数据
    '''

    def save_test(self, item):
        print("into save_test")
        # 先检查数据库是否存在,不存在则保存
        # 定义将要执行的sql语句
        sql_count = 'select count(id) from test where name = %s'
        # 拼接并执行sql语句
        self.cursor.execute(sql_count, [item['name']])
        # 取到查询结果>>>取一条
        results = self.cursor.fetchone()
        if 0 == results[0]:
            try:
                '''
                print(item['name'])
                print(item['type'])
                print(item['content'])
                '''
                sql = "insert into test(name, type, content) values(%s, %s, %s)"
                self.cursor.execute(sql, [item['name'], item['type'], item['content']])
                thisId = self.cursor.lastrowid
                print('test表保存成功,id为:' + repr(thisId))
                self.conn.commit()
            except Exception as ex:
                print("出现如下异常%s" % ex)
                print('回滚')
                self.conn.rollback()
pipelines.py.tmpl View Code

middlewares.py.tmpl:Python项目框架的中间件配置类(默认,无需配置)

items.py.tmpl:自定义实体属性类

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy

'''
@Author: System
@Date: newTimes
@Description: TODO 自定义实体属性
'''
class ${ProjectName}Item(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    pass  # 占位符


'''
@Author: System
@Date: newTimes
@Description: TODO 对应数据库中的test表
'''


class TestItem(scrapy.Item):
    # 名称(对应 test 表中的name字段)
    name = scrapy.Field()
    # 类型
    type = scrapy.Field()
    # 详情内容
    content = scrapy.Field()
items.py.tmpl View Code

spiders>test.py:测试类可配置可不配置

from urllib.request import urlopen

import scrapy
from bs4 import BeautifulSoup

import requests

'''
@Author: System
@Date: newTimes
@Description: TODO 测试爬虫
'''


class TestSpider(scrapy.Spider):
    # 与 run.py 启动控制类中的名字一致
    name = 'test'
    # 允许访问的域名
    allowed_domains = ['baidu.com']
    # 开始爬取的链接
    start_urls = ['https://image.baidu.com/']
    # 自定义变量
    link = 'https://image.baidu.com'

    '''
    @Author: System
    @Date: newTimes
    @Description: TODO 默认进入此方法
    '''
    def parse(self, response, link=link):
        '''python抓取数据方式>>>开始'''
        # 第一种:response 获取
        data = response.text
        # 第二种:requests 获取
        data = requests.get(link)
        data = data.text
        # 第三种:urlopen 获取
        data = urlopen(link).read()
        # Beautiful Soup自动将输入文档转换为Unicode编码,输出文档转换为utf-8编码
        data = BeautifulSoup(data, "html.parser")
        # 第四种:xpath 解析获取
        data = response.xpath('//div[@id="endText"]').get()
        # Beautiful Soup自动将输入文档转换为Unicode编码,输出文档转换为utf-8编码
        data = BeautifulSoup(data, 'html.parser')
        print(data)
        '''python抓取数据方式>>>结束'''
        # 调用 getLinkContent 方法
        request = scrapy.Request(link, callback=self.getLinkContent)
        # 传参赋值
        request.meta['link'] = link
        request.meta['data'] = data
        yield request

    '''
    @Author: System
    @Date: newTimes
    @Description: TODO 根据link链接封装并保存数据
    '''
    def getLinkContent(self, response):
        print('开始抓取XXX的链接...')
        print(response.meta['link'])
        content = response.xpath('//div[@id="content"]')
        content = "".join(content.extract())
        # 实例化 TestItem 这个类,第一个 name 是在items.py中定义的属性(Ps:自己导入)
        items = TestItem(name='name',
                         type=1,
                         content=content)
        # 用yield关键字把它传去管道
        yield items
test.py View Code

run.py:Python项目控制统一启动类

from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
'''
@Author: System
@Date: newTimes
@Description: TODO 同一进程运行多个spider
'''
process = CrawlerProcess(get_project_settings())
'''启动项目开始爬起文件类'''
process.crawl('test')

process.start()
run.py View Code

3.测试 Python 自定义模板 

scrapy 创建 pyProject 新项目:scrapy startproject pyProject

PyCharm 打开刚刚创建的 pyProject 项目

需要修改的点:

1.数据库配置

2. test 测试类需手动导入(Ps:测试类仅供参考)

     

3.配置 run.py 启动类

点击右上角的Add Configurations

原文地址:https://www.cnblogs.com/mjtabu/p/13596449.html