Rebots协议是什么？

数据的时代，网络爬虫有一定的法律风险，但是只要遵守协议知道抓爬哪些数据是不合法的，我们就能避免。

每个网站一般都有Rebots协议,没有的就都可以爬了。

　　Robots Exclusion Standard,网络爬虫排除标准协议

作用：

　　告知网络爬虫哪些页面可以抓爬，哪些不可以

形式：

　　　在网站跟目录下的robots.txt文件

拿油管举个例子：

　　https://www.youtube.com/robots.txt

打开内容如下

# robots.txt file for YouTube
# Created in the distant future (the year 2000) after
# the robotic uprising of the mid 90's which wiped out all humans.

User-agent: Mediapartners-Google*
Disallow:

User-agent: *
Disallow: /channel/*/community
Disallow: /comment
Disallow: /get_video
Disallow: /get_video_info
Disallow: /live_chat
Disallow: /login
Disallow: /results
Disallow: /signup
Disallow: /t/terms
Disallow: /timedtext_video
Disallow: /user/*/community
Disallow: /verify_age
Disallow: /watch_ajax
Disallow: /watch_fragments_ajax
Disallow: /watch_popup
Disallow: /watch_queue_ajax

Sitemap: https://www.youtube.com/sitemaps/sitemap.xml

　　其中# 注释， *代表所有， /代表跟目录

　　User-agent 来源审查，限制此类协议头抓爬

最后 Robots只是建议不是强制约束，可以不遵守，但是会存在法律风险。

　　在此提倡大家遵守Robots协议，共建良好环境