Python urllib详解

Urllib

官方文档地址：https://docs.python.org/3/library/urllib.html

其主要包括一下模块：

urllib.request 请求模块

urllib.error 异常处理模块

urllib.parse url解析模块

urllib.robotparser robots.txt解析模块
urllib.request.urlopen

　　　urlopen参数如下：

　　urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None

常用参数：

　　　url:访问的地址，一般不只是地址。

　　　data:此参数为可选字段，特别要注意的是，如果选择，请求变为post传递方式,其中传递的参数需要转为bytes，如果是我们只需要通过 urllib.parse.urlencode 转换即可：

import urllib.parse
import urllib.request
data = bytes(urllib.parse.urlencode({"word": "python"}), encoding=  'utf8')
response = urllib.request.urlopen("http://xxxxx", data=data)
print(response.read().decode('utf-8'))

　　timeout:设置网站的访问超时时间

其他参数：

　　context 参数：它必须是 ssl.SSLContext 类型，用来指定 SSL 设置。

　　cafile 和 capath 两个参数：是指定CA证书和它的路径，这个在请求 HTTPS 链接时会有用。

　　cadefault 参数：现在已经弃用了，默认为 False

urlopen返回对象提供方法：

　　read() , readline() ,readlines() , fileno() , close() ：对HTTPResponse类型数据进行操作。

　　info()：返回HTTPMessage对象，表示远程服务器返回的头信息。

　　getcode()：返回Http状态码。

　　geturl()：返回请求的url。

 1 import urllib.request
 2 response = urllib.request.urlopen('http://python.org/')
 3 
 4 #查看 response 的返回类型
 5 print(type(response))
 6 
 7 #查看反应地址信息
 8 print(response)
 9 
10 #查看头部信息_1
11 print(response.info())
12 
13 #查看头部信息_2(http header)
14 print(response.getheaders())
15 
16 #输出头部属性信息
17 print(response.getheader("Server"))
18 
19 #查看响应状态信息_1
20 print(response.status)
21 
22 #查看响应状态信息_2
23 print(response.getcode())
24 
25 #查看响应url地址
26 print(response.geturl())
27 
28 #输出网页源码
29 page = response.read()
30 print(page.decode('utf-8'))

View Code

　　urllib.request.Request

 1 import urllib.request
 2 headers = {'Host': 'www.xicidaili.com',
 3            'User-Agent': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT6.0)',
 4            'Accept': r'application/json, text/javascript, */*; q=0.01',
 5            'Referer': r'http://www.xicidaili.com/', }
 6 
 7 req = urllib.request.Request('http://www.xicidaili.com/', headers=headers)
 8 response = urllib.request.urlopen(req)
 9 html = response.read().decode('utf-8')
10 print(html)

View Code

Request参数如下：

　　urllib.request.Request(url, data=None, headers={}, origin_req_host=None, unverifiable=False, method=None)

常用参数：　　

　　url:访问的地址。

　　data:此参数为可选字段，其中传递的参数需要转为bytes，如果是字典只需要通过 urllib.parse.urlencode 转换即可：

　 headers:http相应headers传递的信息，构造方法：headers 参数传递，通过调用 Request 对象的 add_header() 方法来添加请求头。

其他参数：

　　origin_req_host ：指的是请求方的 host 名称或者 IP 地址。

　　unverifiable ：用来表明这个请求是否是无法验证的，默认是 False 。意思就是说用户没有足够权限来选择接收这个请求的结果。如果没有权限，这时 unverifiable 的值就是 True 。

　　method ：用来指示请求使用的方法，比如 GET ， POST ， PUT 等

urllib.request.ProxyHandler（ip代理）

 1 import urllib.request
 2 
 3 #创建自定义函数
 4 def use_proxy(proxy_addr,url):
 5 
 6     #使用urllib.request.ProxyHandle()设置代理服务器参数
 7     proxy = urllib.request.ProxyHandler({'http':proxy_addr})
 8 
 9     #使用urllib.request.build_opener()创建一个自定义的opener()对象
10     #第一个参数为代理信息，第二个参数为urllib.request.HTTPHandler类
11     opener = urllib.request.build_opener(proxy,urllib.request.HTTPHandler)
12 
13     #为了方便，使用urllib.request.install_opener()创建全局默认的opener对象
14     urllib.request.install_opener(opener)
15     #urllib.request.urlopen()打开指定网址进行爬取，将函数的调用赋值给data
16     data = urllib.request.urlopen(url).read().decode('utf-8')
17     return data
18 
19 proxy_addr = '61.135.217.7:80'
20 data = use_proxy(proxy_addr,"http:4//www.baidu.com")
21 print(len(data))

View Code

urllib.request.HTTPCookieProcessor

1 import http.cookiejar, urllib.request
2 
3 cookie = http.cookiejar.CookieJar()
4 handler = urllib.request.HTTPCookieProcessor(cookie)
5 opener = urllib.request.build_opener(handler)
6 response = opener.open('http://www.baidu.com')

View Code

保存cookie(MozillaCookieJar)

1 filename = 'cookie.txt'  
2 cookie = http.cookiejar.MozillaCookieJar(filename)  
3 handler = urllib.request.HTTPCookieProcessor(cookie)  
4 opener = urllib.request.build_opener(handler)  
5 response = opener.open('http://www.baidu.com')  
6 cookie.save(ignore_discard=True, ignore_expires=True)

View Code

使用cookie

1 import http.cookiejar, urllib.request
2 cookie = http.cookiejar.MozillaCookieJar()
3 cookie.load('cookie.txt', ignore_discard=True, ignore_expires=True)
4 handler = urllib.request.HTTPCookieProcessor(cookie)
5 opener = urllib.request.build_opener(handler)
6 response = opener.open('http://www.baidu.com')
7 print(response.read().decode('utf-8'))

View Code

urllib.error

　　用 try-except来捕捉异常,

　　主要的错误方式就两种：

　　　　URLError（错误信息）

　　　　HTTPError(错误编码)

1 try:
2     data=urllib.request.urlopen(url)
3     print(data.read().decode('utf-8'))
4 except urllib.error.HTTPError as e:
5     print(e.code)
6 except urllib.error.URLError as e:
7     print(e.reason)

View Code