Python爬虫（二）— Python3内置模块 Urllib

前言

以下关于Urllib的内容讲解，强烈推荐深入了解的查看官方文档。

英文版：Urllib https://docs.python.org/3/library/urllib.html

Urllib

Urllib是python内置的HTTP请求库,包括以下模块
- urllib.request 请求模块
- urllib.error 异常处理模块
- urllib.parse url解析模块
- urllib.robotparser robots.txt解析模块

urlopen

urlopen能用于一些简单的请求，不需要设置header信息的。

urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)

主要对url, data，timeout进行设置。看一下代码：

import urllib.request
import urllib.parse
import urllib.error
import socket

"""
1.url：就是打开的测试地址  http://httpbin.org
2.data：发送post请求必须设置的参数，通过bytes(urllib.parse.urlencode())可以将post的数据进行转换放到urllib.request.urlopen的data参数中。
3.timeout：是一个超时设置，超时则抛出异常
"""
data = bytes(urllib.parse.urlencode({'word':'hello'}), encoding='utf8')
try:
    response = urllib.request.urlopen(url='http://httpbin.org/post', data=data, timeout=5)
    print(response.read())
except urllib.error.URLError as e:
    if isinstance(e.reason, socket.timeout):
        print('超时...')

urllib.request.urlopen 返回的响应 response
urlopen返回的是一个 http.client.HTTPResponse 对象：<http.client.HTTPResponse object at 0x0331E430>，response.read()获得的是响应体的内容

import urllib.request

response = urllib.request.urlopen(url='https://www.baidu.com/')
print(type(response))
print(response.status)
print(response.getheaders())
print(response.getheader('Server'))
"""
<class 'http.client.HTTPResponse'>
200
[('Accept-Ranges', 'bytes'), ('Cache-Control', 'no-cache'),  ......]
BWS/1.1
"""

request

如果需要对请求设置header信息，就需要使用request。主要是对如何增加请求头进行说明：

from urllib import request, parse

url = 'http://httpbin.org/post'
# 第一种方式，构造header字典
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36',
    'Host': 'httpbin.org'
}
# 第二种方式:req.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36')

dict = {
    'word':'hello'
}
data = bytes(parse.urlencode(dict), encoding='utf8')
req = request.Request(url=url, data=data, headers=headers, method='POST')
response = request.urlopen(req)
print(response.read().decode('utf-8'))

高级用法

代理 ProxyHandler
某些网站对IP访问次数、频率有所限制，因此就需要随时切换IP，避免爬虫出错停止运行的情况。

import urllib.request

proxy_handler = urllib.request.ProxyHandler({
    'http': 'http://127.0.0.1:9743',
    'https': 'https://127.0.0.1:9743'
})
opener = urllib.request.build_opener(proxy_handler)
response = opener.open('http://httpbin.org/get')
print(response.read())

cookie HTTPCookiProcessor
cookie保存了登录信息，保存到http.cookijar，可以方便使用。更多的时候，对于难以获取cookie的网站，我们通常是使用selenium获取cookie，再通过其他高效的方式进行爬取。

import http.cookiejar, urllib.request

cookie = http.cookiejar.CookieJar()
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
for item in cookie:
    print(item.name+"="+item.value)
"""
BAIDUID=68AF7F00874AE2D8206AC4B524B49EAB:FG=1
BIDUPSID=68AF7F00874AE2D8206AC4B524B49EAB
H_PS_PSSID=1451_21090_18559_29064_28519_29098_28836_28584_26350
PSTM=1558969682
delPer=0
BDSVRTM=0
BD_HOME=0
"""

异常处理

避免发生404 500异常导致的爬虫停止。
URLError,HTTPError，HTTPError是URLError的子类

URLError：reason
HTTPError：code,reason,headers

from urllib import request,error

try:
    response = request.urlopen("http://pythonsite.com/1111.html")
except error.HTTPError as e:
    print(e.reason)
    print(e.code)
    print(e.headers)
except error.URLError as e:
    print(e.reason)
else:
    print("reqeust successfully")
"""
Not Found
404
Date: Mon, 27 May 2019 15:12:43 GMT
Server: Apache
Vary: Accept-Encoding
Content-Length: 207
Connection: close
Content-Type: text/html; charset=iso-8859-1
"""

工具模块 urlparse

urllib.parse.urlparse(urlstring, scheme=’’, allow_fragments=True)

from urllib.parse import urlparse

o = urlparse('http://www.cwi.nl:80/%7Eguido/Python.html', scheme='https')
print(o)
print(o.scheme, o.port, o.geturl())

"""
ParseResult(scheme='http', netloc='www.cwi.nl:80', path='/%7Eguido/Python.html', params='', query='', fragment='')
http 80 http://www.cwi.nl:80/%7Eguido/Python.html
"""

个人博客：Loak 正 - 关注人工智能及互联网的个人博客
文章地址：Python爬虫（二）— Python3内置模块 Urllib