Python爬虫学习笔记3：基本库的使用

学习参考：Python3网络爬虫开发实战

3.1 urllib

官方文档链接为 : https://docs.python.org/3/library/urllib.html

3.1.1 发送请求

1. urlopen()

import urllib.request

response = urllib.request.urlopen('http://www.baidu.com')
# print(response.read().decode('utf-8'))
print(type(response)) # 查看输出的响应类型
'''
它是一个 HTTPResposne类型的对象，主要包含 read()、 readinto()、 getheader(name)、
getheaders()、 fileno()等方法，以及 msg、 version、 status、 reason、 debuglevel、 ιlosed等属性
'''
print(response.status) # 返回结果状态码
print(response.getheaders())  # 响应头信息
print(response.getheader('Server'))  # 调用 getheader() 方法并传递一个参数 Server 获取了响应头中的 Server 值
print(response.getheader('Set-Cookie'))

获取下来的结果：

• data参数

data 参数是可选的。如果要添加该参数，并且如果它是字节流编码格式的内容，即 bytes 类型，则需要通过 bytes()方法转化。另外，如果传递了这个参数，则它的请求方式就不再是 GET方式，而是 POST方式

# data参数
import urllib.parse
import urllib.request

data = bytes(urllib.parse.urlencode({'word':'hello'}),encoding='utf8') #bytes()方法，该方法的第一个参数需要是 str (字符串)类型，需要用 urllib.parse 模 块里的 urlencode()方法来将参数字典转化为字符串;第二个参数指定编码格式，这里指定为 utf8
# 用str转，405 错误
response = urllib.request.urlopen('http://httpbin.org/post',data=data)
print(response.read())
print(response.status)

　还是报405错误，提示方法不被允许

• timeout参数

timeout参数用于设置超时时间，单位为秒，意思就是如果请求超出了设置的这个时间，还没有得到响应，就会抛出异常。如果不指定该参数，就会使用全局默认时间。它支持 HTTP, HTTPS、 FTP 请求

#timeout参数
# import urllib.request
#
# response = urllib.request.urlopen('http://www.baidu.com',timeout=1) # 超过一秒没反应就跳出异常
# print(response.status)
# print(response.read())

import urllib.request
import urllib.error
import socket

try:
    response = urllib.request.urlopen('http://www.baidu.com',timeout=0.01) # 超过一秒没反应就跳出异常
    print(response.status)
    print(response.read())
except urllib.error.URLError as e:
    if isinstance(e.reason, socket.timeout): # 判断一个对象是否是一个已知的类型
        print('TIME OUT')

2. Request

# request
# import urllib.request
#
# request = urllib.request.Request('http://www.baidu.com')
# response = urllib.request.urlopen(request)
# print(response.read())

'''
Request参数
URL
data:data 如果要传，必须传 bytes (字节流)类型的 。 如果它是字典，可以先用 urllib.parse模块里的 urlencode()编码。
headers:是一个字典,通过 headers参数直 接构造，也可以通过调用请求实例的 add_header()方法添加.通过修改 User-Agent 来伪装浏览器，默认的 User-Agent 是Python-urllib
origin_req_host:请求方的 host名称或者 IP地址
unverifiable:
'''

import urllib.request
import urllib.parse

url = 'http://www.baidu.com'
headers = {
    'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36',
    'Host':'http://www.baidu.com'
}
dict = {
    'name':'abcd'
}
data = bytes(urllib.parse.urlencode(dict),encoding='utf-8')
req = urllib.request.Request(url=url,headers=headers, method='POST')
response = urllib.request.urlopen(req)
print(response.read().decode('utf-8'))

3. 高级用法

Handler 各种处理器，有专门处理登录验证的，有处理 Cookies 的，有处理代理设置的。

https://docs.python.org/3/library/urllib.request.html#urllib.request.BaseHandler

urllib .request 模块里的 BaseHandler类，它是所有其他 Handler 的父类，它提供了最基本的方法，例如 default_open()、 protocol_request()等

·验证

需要登陆验证的网址，借助HTTPBasicAuthHandler

# 验证
from urllib.request import HTTPPasswordMgrWithDefaultRealm, HTTPBasicAuthHandler,build_opener
from urllib.error import URLError

username = 'username'
password = 'password'
url = 'http://localhost:5000'

p = HTTPPasswordMgrWithDefaultRealm()  # 实例化
p.add_password(None,url,username,password) # 调用类里面的方法
auth_handler = HTTPBasicAuthHandler(p)
opener = build_opener(auth_handler)

try:
    result = opener.open(url)
    html = result.read().decode('utf-8')
    print(html)
except URLError as e:
    print(e.reason)

·代理

# 代理
# from urllib.error import URLError
# from urllib.request import ProxyHandler, build_opener
# 
# proxy_handler = ProxyHandler({
#     'http':'http://',
#     'https':'https://'
# })
# opener = build_opener(proxy_handler)
# try:
#     response = opener.open('http://www.baidu.com')
#     print(response.read().decode('utf-8'))
# except URLError as e:
#     print(e.reason)

• Cookies

# cookie
# import urllib.request
# import http.cookiejar
#
# cookie = http.cookiejar.CookieJar()
# handler = urllib.request.HTTPCookieProcessor(cookie)
# opener = urllib.request.build_opener(handler)
# response = opener.open('http://www.baidu.com')
# for item in cookie:
#     print(item.name + "=" + item.value)

# 保存cookies到文件，使用MozillaCookieJar
# import urllib.request
# import http.cookiejar
#
# filename = 'cookie.txt'
# cookie = http.cookiejar.MozillaCookieJar(filename)
# handler = urllib.request.HTTPCookieProcessor(cookie)
# opener = urllib.request.build_opener(handler)
# response = opener.open('http://www.baidu.com')
# cookie.save(ignore_discard=True,ignore_expires=True)


# 读取保存的cookie文件
# import urllib.request
# import http.cookiejar
# 
# cookie = http.cookiejar.MozillaCookieJar()
# cookie.load('cookie.txt',ignore_expires=True,ignore_discard=True)
# handler = urllib.request.HTTPCookieProcessor(cookie)
# opener = urllib.request.build_opener(handler)
# response = opener.open('http://www.baidu.com')
# print(response.read().decode('utf-8'))

3.1.2 处理异常

1 URLError

RLError类来自 urllib库的 error模块，它继承自 OSError类，是 error异常模块的基类，由 request 模块生的异常都可以通过捕获这个类来处理

2. HTIPError

它是 URLError的子类，专门用来处理HTTP请求错误，比如认证请求失败等。它有如下 3个属性.

3.1.3 解析链接

urllib库里还提供了parse模块，它定义了处理URL的标准接口，例如实现URL各部分的抽取、合并以及链接转换.

1. urlparse()

# urlparse
from urllib.parse import urlparse

result = urlparse("http://www.baidu.com/index.html;user?id=S#comment")
print(result)
# ParseResult(scheme='http', netloc='www.baidu.com', path='/index.html', params='user', query='id=S', fragment='comment')

　返回结果是一个 ParseResult 类型的对象，它包含 6个部分，分别是 scheme、 netloc、path、 params、 query和 fragment.

://前面的就是 scheme，代表协议;第一个/符号前面便是 netloc，即域名，后面是 path，即访问路径;分号;前面是 params，代表参数;问号?后面是查询条件 query，一般用作 GET类型的 URL;

井号#后面是锚点，用于直接定位页面内部的下拉位置.

scheme://netloc/path ;params?query#fragment

urllib.parse.urlparse(urlstring, scheme=”, allow_fragments=True)

2. urlunparse()

3. urlsplit()

4. urlunsplit()

5. urljoin()

6. urlencode()

# urlencode()
from urllib.parse import urlencode

params = {
    'name':'germy',
    'age':22
}
base_url = 'http://www.baidu.com?'
url = base_url + urlencode(params)
print(url)
# http://www.baidu.com?name=germy&age=22

7. parse_qs() 反序列化，字典类型

8. parse_qsl() 反序列化，元祖类型

9. quote()

内容转化为 URL 编码的格式。 URL 中带有中文参数时，有时可能会导致乱码的问题，此时用这个方法可以将巾文字符转化为 URL编码

# quote
from urllib.parse import quote

keyword = '中国'
URL = 'http://www.baidu.com/s?wd=' + quote(keyword)
print(URL)
# http://www.baidu.com/s?wd=%E4%B8%AD%E5%9B%BD

10. unquote() 解码

3.1.4 分析 Robots协议

1. Robots 协议

来告诉爬虫和搜索引擎哪些页面可以抓取，哪些不可以抓取。它通常是一个叫作 rob。“.txt 的文本文件，一般放在网站的根目录下

3. robotparser

解析 robots.txt, 提供了一个类 RobotFileParser,可以根据某网站的 robots.txt文件来判断一个爬取爬虫是否有权限来爬取这个网页

3.2 使用 requests

其他的请求类型依然可以用一句话来完成 .

如果要附加额外的信息,利用 params这个参数

如果想直接解析返回结果，得到一个字典格式的话，可以直接调用 json()方法

·抓取网页

import requests
import re

# r = requests.get('http://www.baidu.com')
# r = requests.post('http://www.baidu.com')
headers = {
    'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'
}
r = requests.get("https://www.zhihu.com/explore",headers=headers)
pattern = re.compile('explore-feed.*?question_link.*?>(.*?)</a>',re.S)
titles = re.findall(pattern,r.text)
print(titles)

·抓取二进制数据

import requests

r = requests.get("https://github.com/favicon.ico")
with open('favicon.ico','wb') as f:
    f.write(r.content)

3.2.2 高级用法

1. 文件上传

# import requests
# 
# files = {'file': open('favicon.ico','rb')}
# r = requests.post('http://httpbin.org/post',files = files)
# print(r.text)

2. Cookies

import requests

r = requests.get('http://www.baidu.com')
print(r.cookies)

for key, value in r.cookies.items():
    print(key + "=" + value)

3 会话维持

维持同一个会话，也就是相当于打开一个新的浏览器选项卡而不是新开一个浏览器。但是我又不想每次设置 cookies，可以通过设置session来满足。

# 会话保持
import requests

s = requests.Session()
s.get('http://httpbin.org/cookies/set/number/123456789')
r = s.get('http://www.httpbin.org/cookies')
print(r.text)
'''

{
"cookies": {
"number": "123456789"
}
}

'''

3.3 正则表达式

w：匹配字母、数字、下划线

W：匹配不是字母、数字、下划线的字符

s：匹配任意空白字符

S：匹配任意非空白字符

d：匹配任意数字

D：匹配任意非数字的字符

A：匹配字符串开头

：匹配字符串结尾，如果存在换行，只能匹配到换行前的结束字符串

z：匹配字符串结尾，如果存在换行，同时还能匹配换行符号

G：匹配最后匹配完成的位置

：换行符号

：制表符

^：匹配一行字符串的开头

$：匹配结尾

.：匹配任意字符，除了换行符

*：匹配0个或者多个

+：匹配1个或者多个

# 正则表达式
# import re
# 
# content = "Hello 123 4567 World_This is a Regex Demo"
# result = re.match('^Hellosd{3}sd{4}sw{10}',content)
# print(result)
# print(result.group())
# print(result.span())

·修饰符

match()方法是从字符串的开头开始匹配的，一旦开头不匹配，那么整个匹配就失败

search()，它在匹配时会扫描整个字符串，然后返回第一个成功匹配的结果。

findall()方法该方法会搜索整个字符串，然后返回匹配正则表达式的所有内容

sub() 替换文本

compile() 将正则字符串编译成正则表达式对象，以便在后面的匹配中复用

3.4 抓取猫眼电影排行

爬取猫眼排名100的电影信息

import requests
import re
import json

def get_one_page(url):
    headers = {
        'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'
    }
    response = requests.get(url, headers=headers)
    if response.status_code == 200:
        return response.text
    return None

def parse_one_page(html):
    pattern = re.compile('<dd>.*?board-index.*?>(d+)</i>.*?data-src="(.*?)@.*?".*?name"><a'
                        + '.*?>(.*?)</a>.*?star">(.*?)</p>.*?releasetime">(.*?)</p>'
                        + '.*?integer">(.*?)</i>.*?fraction">(.*?)</i>.*?</dd>',re.S)
    items = re.findall(pattern,html)
    for item in items:
        yield {
            'index':item[0],
            'image':item[1],
            'title':item[2].strip(),
            'actor':item[3].strip()[3:] if len(item[3])>3 else '',
            'time':item[4].strip()[5:] if len(item[4])>5 else '',
            'score':item[5].strip() + item[6].strip()
        }

def write_to_file(content):
    with open('result1.txt','a', encoding='utf-8') as f:
        print(type(json.dumps(content)))
        f.write(json.dumps(content,ensure_ascii=False)+'
')

def main(offset):
    url = 'http://maoyan.com/board/4?offset=' + str(offset)
    html = get_one_page(url)
    for item in parse_one_page(html):
        write_to_file(item)

if __name__ == "__main__":
    for i in range(10):
        main(offset=i * 10)