解析网站robots.txt是否可以爬取

通过user_agent 和url判断网页是否可爬

from urllib import robotparser
rb = robotparser.RobotFileParser()
rb.set_url("https://www.jd.com/robots.txt")
rb.read()
url = "https://www.jd.com"
user_agent = "HuihuiSpider"
rb.can_fetch(user_agent, url)
False
rb.can_fetch("sougou", url)
True

【推广】免费学中医，健康全家人

原文地址：https://www.cnblogs.com/g2thend/p/12631747.html

推荐文章
【leetcode_easy_array】1200. Minimum Absolute Difference
【leetcode_easy_array】1313. Decompress Run-Length Encoded List
【leetcode_easy_array】1232. Check If It Is a Straight Line
【leetcode_easy_array】989. Add to Array-Form of Integer
PC的virtualBox安装ubuntu16.04 虚拟机
Centos7下安装Docker（详细的新手装逼教程） (转)
VirtualBox下安装CentOS7系统 (转)
解决virtualbox桥接后无法联网, 网络地址转换(NAT)无法ping通宿主机(转)
idea2020.2.3环境配置
在 Idea 上面使用 Tomcat 时,发现控制台打印信息的时候,出行中文乱码问题（转）
如何在ERP系统基础上拓展SRM系统（转）
提升供应链竞争力，除了ERP，还需要这个最佳拍档！（转）
【588】获取 Python 包的版本号
【587】安装 labelme
【586】Terminal 使用 for 语句
【585】终端/python 实现文件压缩与解压 | 其他相关命令
【584】如何保存 Keras 模型？
No package 'gstreamer-base-1.0' found
安装TBB
Ubuntu安装openGV
VIO（4）—— 基于滑动窗口算法的 VIO 系统：可观性和一致性
矩阵零空间的含义和物理意义
Spring Boot 2.5.1 发布！我真跟不上了。。。
Redis 是并发安全的吗？你确定？
Spring Boot 集成 Apollo 配置中心，真香、真强大！
Spring Boot 接入支付宝，实战来了！
微服务必须具备的 3 个基本功能！
如何加速 Nginx 的服务响应？
golang在日志中打印堆栈信息
Maven