Scrapy定制起始请求

Scrapy引擎来爬虫中取起始的URL

1、调用start_requests方法(父类),并获取返回值

2、将放回值变成迭代器,通过iter()

3、执行__next__()方法取值

4、把返回值全部放到调度器中

在爬虫类中重写start_requests方法

from scrapy import Request, Spider
from urllib.parse import quote

class XXSpider(Spider):
    name = 'XX'
    allowed_domains = ['www.xx.com']
    base_url = 'https://xx.com/search?q='
    
    def start_requests(self):
        for key in selector.settings.get('KEYWORDS'):
            for page in range(1, self.settings.get('MAX_PAGE') + 1):
                url = self.base_url + quote(key)
                yield Request(url=url, callback=self.parse, meta={'page': page}, dont_filter=True)

 注意:原来的start_urls要删除

原文地址:https://www.cnblogs.com/wt7018/p/11745303.html