【Python学习日记】B站小甲鱼：爬虫

Web Spider

Python 如何访问互联网

URL + lib -->urllib

　　URL的一般格式为 protocol://hostname[:port] / /path /[;parameters][?query]#fragment，其中[]为可选项

　　URL由三部分组成

　　　　第一部分是协议

　　　　第二部分是存放资源的服务器的域名系统或IP地址（有时候要包含端口号，各种传输协议都有默认的端口号）

　　　　第三部分是资源的具体地址，如目录或文件名

urllib是python的一个包

　　下面这个程序展示了获取百度新闻页面的网页数据的程序

import urllib.request

response = urllib.request.urlopen('http://news.baidu.com/')
html = response.read()
html = html.decode('utf-8')
print(html)

　　获得的response是二进制的，所以需要通过utf-8解码

练习　　从placekitten上保存一张猫猫的图片

import urllib.request

response = urllib.request.urlopen('http://placekitten.com/g/500/600')
cat_img = response.read()
with open('cat_500_600.jpg','wb') as f:
    f.write(cat_img)

　　首先urlopen的参数可以是一个字符串也可以是一个request 对象

　　因此代码也可以写作把Request实例化

import urllib.request

req = urllib.request.Request('http://placekitten.com/g/500/600')
response = urllib.request.urlopen(req)

cat_img = response.read()
with open('cat_500_600.jpg', 'wb') as f:
    f.write(cat_img)

　　Python提交POST表单访问有道翻译

　　爬有道词典，但是没有成功，原因是有道翻译添加了反爬机制salt和sign。

import urllib.request
import urllib.parse

url1 = 'http://fanyi.youdao.com/translate_o?smartresult=dict&smartresult=rule'
data = {'i': '你好!', 'type': 'AUTO', 'doctype': 'json', 'version': '2.1', 'keyfrom': 'fanyi.web', 'ue': 'UTF-8',
        'typoresult': 'true'}
data = urllib.parse.urlencode(data).encode('utf-8')  # 把data编码

response = urllib.request.urlopen(url1, data)  # 发出请求，得到相应
html = response.read().decode('utf-8')      #read之后得到的是utf-8的格式，解码成Unicode的形式
print(html)

Request 有一个heads的参数，heads的格式是字典

修改heads可以通过两个方式修改

　　1.通过Request的headers参数修改

　　2.通过Request.add_header()方法修改

为了使爬虫更像人类，可以通过

1.time来控制时间戳，限制单位时间内IP的访问次数

　　import time

　　...

　　time.sleep(5)

2.代理

　　通过代理去访问服务器

　　1.参数是一个字典{‘类型’ ： ‘代理ip：端口号’}

　　　　proxy_support = urllib.request.ProxyHandler({})

　　2.定制一个opener

　　　　opener = urllib.request.build_opener(proxy_support)

　　3.1.安装opener

　　　　urllib.request.install_opener(opener)

　　3.2.调用opener

　　　　opener.open(url)

教程使用的网站现在都设置了复杂的反爬机制了，所以运行没有成功。

import urllib.request

url = 'http://www.whatismyip.com.tw'
proxy_support = urllib.request.ProxyHandler({'http': '221.122.91.66:80'})

opener = urllib.request.build_opener(proxy_support)
opener.addheaders = {'User-Agent', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) '
                                   'Chrome/84.0.4147.105 Safari/537.36'}

urllib.request.install_opener(opener)

response = urllib.request.urlopen(url)

html = response.read().decode('utf-8')

print(html)