解析网站robots.txt是否可以爬取

通过user_agent 和url判断网页是否可爬

from urllib import robotparser
rb = robotparser.RobotFileParser()
rb.set_url("https://www.jd.com/robots.txt")
rb.read()
url = "https://www.jd.com"
user_agent = "HuihuiSpider"
rb.can_fetch(user_agent, url)
False
rb.can_fetch("sougou", url)
True


原文地址:https://www.cnblogs.com/g2thend/p/12631747.html