在HTTP协议中,响应状态码 429 Too Many Requests 表示在一定的时间内用户发送了太多的请求,即超出了“频次限制”。
在响应中,可以提供一个 Retry-After 首部来提示用户需要等待多长时间之后再发送新的请求。
HTTP/1.1 429 Too Many Requests Content-Type: text/html Retry-After: 3600 <html> <head> <title>Too Many Requests</title> </head> <body> <h1>Too Many Requests</h1> <p>I only allow 50 requests per hour to this Web site per logged in user. Try again soon.</p> </body> </html> --------------------- 作者:爬不下来就自闭 来源:CSDN 原文:https://blog.csdn.net/weixin_43870646/article/details/90671642 版权声明:本文为博主原创文章,转载请附上博文链接!
服务器不主动拒绝请求,不封ip,但是会限制请求频率,所有我们要尊重服务器的设置,适当降低请求频率,不要试图解决它。
但是,有些人就是不信邪(比如我),所以在请求大量数据的时候,还是可以做一些措施,在边缘疯狂试探。
我们可以修改scrapy的中间件,以便收到错误时暂停。等待一会儿后再继续执行爬虫。
---------------------
作者:爬不下来就自闭
来源:CSDN
原文:https://blog.csdn.net/weixin_43870646/article/details/90671642
版权声明:本文为博主原创文章,转载请附上博文链接!
from scrapy.downloadermiddlewares.retry import RetryMiddleware from scrapy.utils.response import response_status_message import time class TooManyRequestsRetryMiddleware(RetryMiddleware): def __init__(self, crawler): super(TooManyRequestsRetryMiddleware, self).__init__(crawler.settings) self.crawler = crawler @classmethod def from_crawler(cls, crawler): return cls(crawler) def process_response(self, request, response, spider): if request.meta.get('dont_retry', False): return response elif response.status == 429: self.crawler.engine.pause() time.sleep(60) # If the rate limit is renewed in a minute, put 60 seconds, and so on. self.crawler.engine.unpause() reason = response_status_message(response.status) return self._retry(request, reason, spider) or response elif response.status in self.retry_http_codes: reason = response_status_message(response.status) return self._retry(request, reason, spider) or response return response --------------------- 作者:爬不下来就自闭 来源:CSDN 原文:https://blog.csdn.net/weixin_43870646/article/details/90671642 版权声明:本文为博主原创文章,转载请附上博文链接!
添加429以重试代码 settings.py
RETRY_HTTP_CODES = [429]