《数据采集与网络爬虫》之抓取网页

一、urllib库的使用

1、简单使用urllib库爬取整个网页源代码

import urllib.request
url='http://www.baidu.com'
response=urllib.request.urlopen(url)
print(type(response)) # <class 'http.client.HTTPResponse'>
print(dir(response)) # dir用来显示该对象所有的方法和属性
print(type(response.read())) # <class 'bytes'>
print(response.read().decode()) # decode()默认使用UTF-8进行解码
print(response.geturl()) # http://www.baidu.com
print(response.getcode()) # 200
print(response.info()) # Bdpagetype: 1...
print(response.headers) # 和info()一样

2、简单使用urllib发出带参数的GET请求

import urllib.request
import urllib.parse

url='http://www.baidu.com/s'
params={'wd':'NBA全明星'} # 字典格式
params=urllib.parse.urlencode(params) # 对参数进行url编码
print(params) # wd=NBA%E5%85%A8%E6%98%8E%E6%98%9F
url=url+'?'+params # http://www.baidu.com/s?wd=NBA%E5%85%A8%E6%98%8E%E6%98%9F
response=urllib.request.urlopen(url)
print(response.read().decode())

3、使用urllib发出POST请求

import urllib.request
import urllib.parse
url='http://httpbin.org/post'
data={'username':'恩比德','age':28}
data=urllib.parse.urlencode(data) 
print(data) # username=%E6%81%A9%E6%AF%94%E5%BE%B7&age=28   url解码可使用urllib.parse.unquote(str)
data=bytes(data.encode()) # POST请求发送的数据必须是bytes,一般先要进行url编码，然后转换为bytes
request=urllib.request.Request(url,data=data) # 只要指定data参数，就一定是POST请求
response=urllib.request.urlopen(request) # 使用构造的Request对象，通过urlopen方法发送请求
print(response.read().decode())

4、使用urllib伪装成浏览器发出请求

import urllib.request
url='http://httpbin.org/get'
# User-Agent用来指定使用的浏览器
headers={'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.182 Safari/537.36'}
request=urllib.request.Request(url,headers=headers) # 指定headers参数构造Request对象
response=urllib.request.urlopen(request) # 使用Request对象发出请求
print(response.read().decode())

5、使用代理IP发出请求

import urllib.request
import random

# 免费的代理IP列表,zhimahttp.com,免费代理IP寿命短，速度慢，匿名度不高
proxy_list = [
    {'http': '183.166.180.184:9999'},
    {'http': '221.230.216.169:8888'},
    {'http': '182.240.34.61:9999'},
    {'http': '121.226.154.250:8080'}
]

proxy = random.choice(proxy_list) # 随机选择一个
print(proxy)

# 1.构造ProxyHandler
proxy_handler = urllib.request.ProxyHandler(proxy)

# 2.使用构造好的ProxyHandler对象来自定义opener对象
opener = urllib.request.build_opener(proxy_handler)

# 3.使用代理发出请求
request = urllib.request.Request('http://www.baidu.com') # 构造请求对象传入opener对象的open()方法
response = opener.open(request)

print(response.read().decode())

我们可以使用代理服务器，每隔一段时间换一个代理。如果某个IP被禁止，那么就可以换成其他IP继续爬取数据，从而可以有效解决被网站禁止访问的情况。

6、URLError异常和捕获

import urllib.request
import urllib.error
url='http://www.whit.edu.cn/net' # 错误url
request = urllib.request.Request(url)
try:
        urllib.request.urlopen(request)
except urllib.error.HTTPError as e:
        print(e.code) # 404

7、捕获超时异常

import urllib.request
import urllib.error
try:
       url = 'http://218.56.132.157:8080'
       # timeout设置超时的时间
       response = urllib.request.urlopen(url, timeout=1)
       result = response.read()
       print(result)
except Exception as error:
       print(error) # <urlopen error timed out>

二、requests库的使用

1、使用requests发出带参数的get请求

import requests
params={'wd':'爬虫'}
url='http://www.baidu.com/s'
response=requests.get(url,params=params)
print(response.text)
# print(type(response)) # <class 'requests.models.Response'>
# print(response.encoding) # utf-8
# print(type(response.content)) # <class 'bytes'>
# print(response.content.decode()) # 和response.text的结果一样

说明：使用requests发出带参数的GET请求时，参数使用字典格式，无需url编码，无需拼接url。和“一”中的“2”对比一下。

Response对象常用的属性：

　　text:响应内容的字符串形式

　　encoding:响应内容的编码方式

　　status_code:状态码

2、使用requests库发送POST请求

import requests
url="http://httpbin.org/post"
data={'name':'库里'} # POST请求要发送的数据
response=requests.post(url,data)
print(response.text)

说明：使用requests发出POST请求时，要发送的数据使用字典格式，无需url编码，无需转换为bytes。和“一”中“3”对比一下。

三、访问带https网站

import urllib.request
import requests
import ssl

url="https://www.runoob.com/"
headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.192 Safari/537.36'}

# 1.urllib不伪装浏览器，直接访问
# print("1."+'-'*50)
# response=urllib.request.urlopen(url)
# print(response.read().decode())
# urllib.error.URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:833)>

# 2.urllib伪装成浏览器访问
# print("2."+'-'*50)
# request=urllib.request.Request(url,headers=headers)
# response=urllib.request.urlopen(request)
# print(response.read().decode())
# urllib.error.URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:833)>

# 3.urllib不伪装浏览器，导入ssl模块
# 当使用urllib模块访问https网站时，由于需要提交表单，而python3默认是不提交表单的，所以这时只需在代码中加上以下代码即可。
# ssl._create_default_https_context = ssl._create_unverified_context
# print("3."+'-'*50)
# response=urllib.request.urlopen(url)
# print(response.read().decode())
# OK

# 4.request不伪装浏览器访问
print("4."+'-'*50)
response=requests.get(url)
response.encoding="UTF-8"
html=response.text
print(html)
# OK

# 5.request伪装浏览器访问
# print("5."+'-'*50)
# response=requests.get(url,headers)
# response.encoding="UTF-8"
# html=response.text
# print(html)
# OK

1.对于带https的网站，urllib模块无论是伪装浏览器还是不伪装浏览器，都会出现证书验证证失败的错误，
但可以尝试导入ssl模块，并设置访问时忽略证书验证。
2.对于带https的网站，requests一般可直接访问，如果不能访问，则可尝试伪装成浏览器访问。
3.对于有些反爬措施比较厉害的网站，如果某个IP频繁访问超过规定的次数，IP可能会被禁用，可使用代理IP
继续访问。
所以，尽量使用requests库，不行的话，伪装成浏览器再试试，再不行的话，使用代理IP访问。