Python爬虫(二)— Python3内置模块 Urllib

前言

以下关于Urllib的内容讲解,强烈推荐深入了解的查看官方文档。

英文版:Urllib https://docs.python.org/3/library/urllib.html

Urllib

  • Urllib是python内置的HTTP请求库,包括以下模块
    • urllib.request 请求模块
    • urllib.error 异常处理模块
    • urllib.parse url解析模块
    • urllib.robotparser robots.txt解析模块

urlopen

urlopen能用于一些简单的请求,不需要设置header信息的。

urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)

  • 主要对url, data,timeout进行设置。看一下代码:
import urllib.request
import urllib.parse
import urllib.error
import socket

"""
1.url:就是打开的测试地址  http://httpbin.org
2.data:发送post请求必须设置的参数,通过bytes(urllib.parse.urlencode())可以将post的数据进行转换放到urllib.request.urlopen的data参数中。
3.timeout:是一个超时设置,超时则抛出异常
"""
data = bytes(urllib.parse.urlencode({'word':'hello'}), encoding='utf8')
try:
    response = urllib.request.urlopen(url='http://httpbin.org/post', data=data, timeout=5)
    print(response.read())
except urllib.error.URLError as e:
    if isinstance(e.reason, socket.timeout):
        print('超时...')
  • urllib.request.urlopen 返回的响应 response
    urlopen返回的是一个 http.client.HTTPResponse 对象:<http.client.HTTPResponse object at 0x0331E430>,response.read()获得的是响应体的内容
import urllib.request

response = urllib.request.urlopen(url='https://www.baidu.com/')
print(type(response))
print(response.status)
print(response.getheaders())
print(response.getheader('Server'))
"""
<class 'http.client.HTTPResponse'>
200
[('Accept-Ranges', 'bytes'), ('Cache-Control', 'no-cache'),  ......]
BWS/1.1
"""

request

如果需要对请求设置header信息,就需要使用request。主要是对如何增加请求头进行说明:

from urllib import request, parse

url = 'http://httpbin.org/post'
# 第一种方式,构造header字典
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36',
    'Host': 'httpbin.org'
}
# 第二种方式:req.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36')

dict = {
    'word':'hello'
}
data = bytes(parse.urlencode(dict), encoding='utf8')
req = request.Request(url=url, data=data, headers=headers, method='POST')
response = request.urlopen(req)
print(response.read().decode('utf-8'))

高级用法

  • 代理 ProxyHandler
    某些网站对IP访问次数、频率有所限制,因此就需要随时切换IP,避免爬虫出错停止运行的情况。
import urllib.request

proxy_handler = urllib.request.ProxyHandler({
    'http': 'http://127.0.0.1:9743',
    'https': 'https://127.0.0.1:9743'
})
opener = urllib.request.build_opener(proxy_handler)
response = opener.open('http://httpbin.org/get')
print(response.read())
  • cookie HTTPCookiProcessor
    cookie保存了登录信息,保存到http.cookijar,可以方便使用。更多的时候,对于难以获取cookie的网站,我们通常是使用selenium获取cookie,再通过其他高效的方式进行爬取。
import http.cookiejar, urllib.request

cookie = http.cookiejar.CookieJar()
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
for item in cookie:
    print(item.name+"="+item.value)
"""
BAIDUID=68AF7F00874AE2D8206AC4B524B49EAB:FG=1
BIDUPSID=68AF7F00874AE2D8206AC4B524B49EAB
H_PS_PSSID=1451_21090_18559_29064_28519_29098_28836_28584_26350
PSTM=1558969682
delPer=0
BDSVRTM=0
BD_HOME=0
"""

异常处理

避免发生404 500异常导致的爬虫停止。
URLError,HTTPError,HTTPError是URLError的子类

  • URLError:reason
  • HTTPError:code,reason,headers
from urllib import request,error

try:
    response = request.urlopen("http://pythonsite.com/1111.html")
except error.HTTPError as e:
    print(e.reason)
    print(e.code)
    print(e.headers)
except error.URLError as e:
    print(e.reason)
else:
    print("reqeust successfully")
"""
Not Found
404
Date: Mon, 27 May 2019 15:12:43 GMT
Server: Apache
Vary: Accept-Encoding
Content-Length: 207
Connection: close
Content-Type: text/html; charset=iso-8859-1
"""

工具模块 urlparse

urllib.parse.urlparse(urlstring, scheme=’’, allow_fragments=True)

from urllib.parse import urlparse

o = urlparse('http://www.cwi.nl:80/%7Eguido/Python.html', scheme='https')
print(o)
print(o.scheme, o.port, o.geturl())

"""
ParseResult(scheme='http', netloc='www.cwi.nl:80', path='/%7Eguido/Python.html', params='', query='', fragment='')
http 80 http://www.cwi.nl:80/%7Eguido/Python.html
"""

个人博客:Loak 正 - 关注人工智能及互联网的个人博客
文章地址:Python爬虫(二)— Python3内置模块 Urllib

原文地址:https://www.cnblogs.com/l0zh/p/13739740.html