scrapy 启动多个爬虫

常见的启动方式

scrapy crawl spider_name

官方提供的启动方式

使用脚本启动

import scrapy
from scrapy.crawler import CrawlerProcess

class MySpider(scrapy.Spider):
    # Your spider definition
    ...

process = CrawlerProcess(settings={
    "FEEDS": {
        "items.json": {"format": "json"},
    },
})

process.crawl(MySpider)
process.start() # the script will block here until the crawling is finished

python run.py

使用`CrawlerRunner`启动，推荐

import scrapy
from twisted.internet import reactor
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging

class MySpider1(scrapy.Spider):
    # Your first spider definition
    ...

class MySpider2(scrapy.Spider):
    # Your second spider definition
    ...

configure_logging()
runner = CrawlerRunner()
runner.crawl(MySpider1)
runner.crawl(MySpider2)
d = runner.join()
d.addBoth(lambda _: reactor.stop())

reactor.run() # the script will block here until all crawling jobs are finished

使用脚本启动

在settings.py中添加

COMMANDS_MODULE = "commands"

在scrapy.cfg同级目录创建commandas/startall.py文件

这里我的scrapy的版本应该是2.2.0，如果是1.8.0则参照scrapy/commands/crawl.py修改

from scrapy.commands import BaseRunSpiderCommand

class Command(ScrapyCommand):
    requires_project = True

    def syntax(self):
        return "[options] <spider>"

    def short_desc(self):
        return "Run all spider"

    def run(self, args, opts):
        for spider_name in sorted(self.crawler_process.spider_loader.list()):
            self.crawler_process.crawl(spider_name, **opts.spargs)
        self.crawler_process.start()
        if self.crawler_process.bootstrap_failed:
            self.exitcode = 1

scrapy 启动多个爬虫

官方提供的启动方式

使用脚本启动

使用CrawlerRunner启动，推荐

使用脚本启动

使用`CrawlerRunner`启动，推荐