Scrapy框架的安装与介绍

一. 安装Scrapy

1.1 先升级python的相关工具

python -m pip install --upgrade pip
python -m pip install --upgrade setuptools

1.2 安装第三方库

pip install pywin32 -i http://pypi.douban.com/simple/ --trusted-host pypi.douban.com
pip install constantly -i http://pypi.douban.com/simple/ --trusted-host pypi.douban.com
pip install queuelib -i http://pypi.douban.com/simple/ --trusted-host pypi.douban.com
pip install lxml -i http://pypi.douban.com/simple/ --trusted-host pypi.douban.com
pip install six -i http://pypi.douban.com/simple/ --trusted-host pypi.douban.com
pip install parsel==1.6.0 -i http://pypi.douban.com/simple/ --trusted-host pypi.douban.com
pip install itemloaders==1.0.1 -i http://pypi.douban.com/simple/ --trusted-host pypi.douban.com
pip install incremental==21.3.0 -i http://pypi.douban.com/simple/ --trusted-host pypi.douban.com
pip install pyopenssl -i http://pypi.douban.com/simple/ --trusted-host pypi.douban.com
pip install Twisted==21.2.0 -i http://pypi.douban.com/simple/ --trusted-host pypi.douban.com
pip install Scrapy==2.4.1 -i http://pypi.douban.com/simple/ --trusted-host pypi.douban.com

1.3. 将scrapy命令添加到环境变量

D:Program Filespython39Scripts>scrapy
Scrapy 2.4.1 - no active project

Usage:
  scrapy <command> [options] [args]

Available commands:
  bench         Run quick benchmark test
  commands
  fetch         Fetch a URL using the Scrapy downloader
  genspider     Generate new spider using pre-defined templates
  runspider     Run a self-contained spider (without creating a project)
  settings      Get settings values
  shell         Interactive scraping console
  startproject  Create new project
  version       Print Scrapy version
  view          Open URL in browser, as seen by Scrapy

  [ more ]      More commands available when run from project directory

Use "scrapy <command> -h" to see more info about a command

二. 创建一个爬虫项目

2.1 通过命令行来创建一个爬虫项目(myscrapy)

scrapy startproject myscrapy

2.2 将爬虫项目导入PyCharm IDE，并指定python解释器(建议使用python虚拟环境)

# 对项目初始化目录的简要介绍
|-myscrapy
  |-spiders         # 存放爬虫文件，通过scrapy genspider命令可以创建爬虫文件
    |-__init__.py
  |-__init__.py
  |-items.py        # 定义解析字段
  |-middlewares.py  # 中间件
  |-pipelines.py    # 定义管道，进行数据的处理与储存
  |-settings.py     # 全局配置
|-scrapy.cfg        # 项目配置文件

2.3 创建一个爬虫funddata

1) 在spider目录下，通过命令行创建一个爬虫文件

scrapy genspider funddata fund.eastmoney.com

2) 命令执行成功后，会在spider目录下生成一个文件funddata.py

# -*- coding: utf-8 -*-
# @Time    : 2021/5/23 23:00
# @Author  : chinablue
# @File    : funddata.py
import scrapy


class FunddataSpider(scrapy.Spider):
    # 爬虫名字
    name = 'funddata'
    # 定义允许爬取的域名
    allowed_domains = ['fund.eastmoney.com']
    # 定义初始的请求url
    start_urls = ['http://fund.eastmoney.com/']

    # 解析页面信息的方法
    def parse(self, response):
        pass

2.4 创建main.py文件，并运行

1) 为了方便在pycharm中运行项目，在项目根目录下创建一个main.py文件

# -*- coding: utf-8 -*-
# @Time    : 2021/5/23 23:07
# @Author  : chinablue
# @File    : main.py

import os
import sys

from scrapy.cmdline import execute

sys.path.append(os.path.dirname(os.path.abspath(__file__)))

# 等价于命令行执行：scrapy crawl funddata
execute(["scrapy", "crawl", "funddata"])

2) 修改settings.py文件

# Obey robots.txt rules
# ROBOTSTXT_OBEY = True
ROBOTSTXT_OBEY = False