【Scrapy】Scrapy爬虫框架的基本用法

Scrapy爬虫框架的基本用法

Scrapy爬虫框架是一个好东西，可以十分简单快速爬取网站，特别适合那些不分离前后端的，数据直接生成在html文件内的网站。本文以爬取杭电OJ http://acm.hdu.edu.cn 的题目ID和标题为例，做一个基本用法的记录

可参考 https://www.jianshu.com/p/7dee0837b3d2

安装Scrapy

使用pip安装
```
pip install scrapy
```

代码编写

建立项目 myspider
```
scrapy startproject myspider
```
创建爬虫 hdu，网站是 acm.hdu.edu.cn
```
scrapy genspider hdu acm.hdu.edu.cn
```

执行上面的命令后，会在spiders文件夹下建立一个 hdu.py，修改代码为：

class HduSpider(scrapy.Spider):
# 爬虫名
name = 'hdu'
# 爬取的目标地址
allowed_domains = ['acm.hdu.edu.cn']
# 爬虫开始的页面
start_urls = ['http://acm.hdu.edu.cn/listproblem.php?vol=1']

# 爬取逻辑
def parse(self, response):
    # 题目列表是写在页面的第二个script下的，先全部取出script到problem_list列表中
    problem_list = response.xpath('//script/text()').extract()
    # 取题目列表，为第二个，index为1，并使用分号分割
    problems = str.split(problem_list[1], ";")
    # 循环在控制台输出。这里没有交给管道进行操作
    for item in problems:
        print(item)

在 items.py 里新建题目的对应类

class ProblemItem(scrapy.Item):
    id = scrapy.Field()
    title = scrapy.Field()

在 pipelines.py 里建立一个数据管道来保存数据到 hdu.json文件内

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
import json


class ItcastPipeline(object):
    def __init__(self):
        self.filename = open("teacher.json", "wb+")

    def process_item(self, item, spider):
        jsontext = json.dumps(dict(item), ensure_ascii=False) + "
"
        self.filename.write(jsontext.encode("utf-8"))
        return item

    def close_spider(self, spider):
        self.filename.close()


class HduPipeline(object):
    full_json = ''

    def __init__(self):
        self.filename = open("hdu.json", "wb+")
        self.filename.write("[".encode("utf-8"))

    def process_item(self, item, spider):
        json_text = json.dumps(dict(item), ensure_ascii=False) + ",
"
        self.full_json += json_text
        return item

    def close_spider(self, spider):
        self.filename.write(self.full_json.encode("utf-8"))
        self.filename.write("]".encode("utf-8"))
        self.filename.close()

setting.py 中给管道进行配置

ITEM_PIPELINES = {
   'myspider.pipelines.HduPipeline': 300
}
# 不遵循网站的爬虫君子约定
ROBOTSTXT_OBEY = False

修改 hdu.py 让其交由管道处理

# -*- coding: utf-8 -*-
import scrapy
import re
from myspider.items import ProblemItem


class HduSpider(scrapy.Spider):
    name = 'hdu'
    allowed_domains = ['acm.hdu.edu.cn']
    start_urls = ['http://acm.hdu.edu.cn/listproblem.php?vol=1']

    def parse(self, response):
        hdu = ProblemItem()
        problem_list = response.xpath('//script/text()').extract()
        problems = str.split(problem_list[1], ";")
        for item in problems:
            # print(item)
            p = re.compile(r'[(](.*)[)]', re.S)
            str1 = re.findall(p, item)[0]
            # print(str1)
            detail = str.split(str1, ",")
            hdu['id'] = detail[1]
            hdu['title'] = detail[3]
            yield hdu

运行命令，这里把日志输出到 all.log 中
```
scrapy crawl hdu  -s  LOG_FILE=all.log
```

在hdu.json文件中看到了爬取的第一页题目标题

{"id": "1000", "title": ""A + B Problem""}
{"id": "1001", "title": ""Sum Problem""}
{"id": "1002", "title": ""A + B Problem II""}
{"id": "1003", "title": ""Max Sum""}
{"id": "1004", "title": ""Let the Balloon Rise""}
{"id": "1005", "title": ""Number Sequence""}

...

{"id": "1099", "title": ""Lottery ""}

再次修改 hdu.py 让其能够爬取全部有效页码的内容

# -*- coding: utf-8 -*-
import scrapy
import re
from myspider.items import ProblemItem


class HduSpider(scrapy.Spider):
    name = 'hdu'
    allowed_domains = ['acm.hdu.edu.cn']
    # download_delay = 1
    base_url = 'http://acm.hdu.edu.cn/listproblem.php?vol=%s'
    start_urls = ['http://acm.hdu.edu.cn/listproblem.php']

    # 爬虫入口
    def parse(self, response):
        # 首先拿到全部有效页码
        real_pages = response.xpath('//p[@class="footer_link"]/font/a/text()').extract()
        for page in real_pages:
            url = self.base_url % page
            yield scrapy.Request(url, callback=self.parse_problem)

    def parse_problem(self, response):
        # 从字符串中抽取有用内容
        hdu = ProblemItem()
        problem_list = response.xpath('//script/text()').extract()
        problems = str.split(problem_list[1], ";")
        for item in problems:
            # hdu有无效空题，进行剔除
            if str.isspace(item) or len(item) == 0:
                return
            p = re.compile(r'[(](.*)[)]', re.S)
            str1 = re.findall(p, item)
            detail = str.split(str1[0], ",")
            hdu['id'] = detail[1]
            hdu['title'] = detail[3]
            yield hdu

再次运行命令，这里把日志输出到 all.log 中
```
scrapy crawl hdu  -s  LOG_FILE=all.log
```

现在能爬到全部页码的全部题目标题信息了。但是特别注意的是，爬取到的内容并不是按顺序排列的，有多种原因决定了顺序

[{"id": "4400", "title": ""Mines""},
{"id": "4401", "title": ""Battery""},
{"id": "4402", "title": ""Magic Board""},
{"id": "4403", "title": ""A very hard Aoshu problem""},
{"id": "4404", "title": ""Worms""},
{"id": "4405", "title": ""Aeroplane chess""},
{"id": "4406", "title": ""GPA""},
{"id": "4407", "title": ""Sum""},

...

{"id": "1099", "title": ""Lottery ""},
]

以上只是爬取到文本文件中，后续将放置到数据库中，本教程暂时略过