安装依赖以及页面解析

Date: 2019-06-19

Author: Sun

本节要学习的库有：

网络库：requests

页面解析库：Beautiful Soup

1 Requests库

虽然Python的标准库中 urllib 模块已经包含了平常我们使用的大多数功能，但是它的 API 使用起来让人感觉不太好，而 Requests 自称 “HTTP for Humans”，说明使用更简洁方便。

Requests 是用Python语言编写，基于 urllib，采用 Apache2 Licensed 开源协议的 HTTP 库。它比 urllib 更加方便，可以节约我们大量的工作，完全满足 HTTP 测试需求。Requests 的哲学是以 PEP 20 的习语为中心开发的，所以它比 urllib 更加 Pythoner。更重要的一点是它支持 Python3 哦!

Requests 唯一的一个非转基因的 Python HTTP 库，人类可以安全享用：）

Requests 继承了urllib的所有特性。Requests支持HTTP连接保持和连接池，支持使用cookie保持会话，支持文件上传，支持自动确定响应内容的编码，支持国际化的 URL 和 POST 数据自动编码

requests 的底层实现其实就是 urllib3

Requests的文档非常完备，中文文档也相当不错。Requests能完全满足当前网络的需求，支持Python 2.6—3.6

1.1 安装 Requests

pip install requests

Requests官方文档：

http://docs.python-requests.org/zh_CN/latest/user/quickstart.html

http协议测试网站：

http://httpbin.org/

1.2 基本用法：

import requests

response = requests.get('http://www.baidu.com')
print(response.request.url) # 等同于response.url
print(response.status_code)
#请求头是请求头，响应头是响应头
print(response.headers['content-type'])    #不区分大小写
print(response.encoding)
print(response.text)       #获取文本，一般情况自动解码

1.3 请求方法

Requests的请求不再像urllib一样需要去构造各种Request，opener和handler，使用Requests构造的方法，并在其中传入需要的参数即可
每一个请求方法都有一个对应的API，比如GET请求就可以使用get()方法

POST请求就可以使用post()方法，并且将需要提交的数据传递给data参数即可

设置访问超时，设置timeout参数即可

requests.get(‘http://github.com’,timeout=0.01)

具体用例说明

import requests
response = requests.get('https://httpbin.org/get')        #拉数据
response = requests.post('http://gttpbin.org/post',data={'key': 'value'})   #推数据

# - post请求四种传送正文方式：
# 　　- 请求正文是application/x-www-form-urlencoded
# 　　- 请求正文是multipart/form-data
# 　　- 请求正文是raw
# 　　- 请求正文是binary

response = requests.put('http://httpbin.org/put',data={'key':'value'})
response = requests.delete('http://httpbin.org/delete')
response = requests.head('http://httpbin.org/get')
response = requests.options('http://httpbin.org/get')

1.4 传递URL参数

（1）传递URL参数也不用再像urllib中那样需要去拼接URL，而是简单的构造一个字典，并在请求时将其传递给params参数

（2）有时候我们会遇到相同的url参数名，但又不同的值，而Python的字典又不支持键的重名，可以把键的值用列表表示

#传递URL参数也不用再像urllib中那样需要去拼接URL，而是简单的构造一个字典，并在请求时将其传递给params参数
import requests
params = {'key1':'value1','key2':'value2'}
response = requests.get('http://httpbin.org/get',params=params)
#有时候我们会遇到相同的url参数名，但又不同的值，而Python的字典又不支持键的重名，可以把键的值用列表表示
params = {'key1':'value1','key2':['value2','value3']}
response = requests.get('http://httpbin.org/get',params=params)
print(response.url)
print(response.content)
#http://httpbin.org/get?key1=value1&key2=value2&key2=value3

1.5 自定义Headers
如果想自定义请求的Headers，同样的将字典数据传递给headers参数
url = ‘http://api.github.com/some/endpoint’
headers = {‘user-agent’:‘my-app/0.0.1’} #自定义headers
response = requests.get(url,headers=headers)

print(response.headers)

1.6 自定义cookies

Requests中自定义cookies也不用再去构造CookieJar对象，直接将字典递给cookies参数

url = ‘http://httpbin.org/cookies’
co = {‘cookies_are’:‘working’}
response = requests.get(url,cookies=co)
print(response.text)   #{“cookies”: {“cookies_are”: “working”}}

1.7 设置代理

#当我们需要使用代理时，同样构造代理字典，传递给proxies参数
import requests
proxies = {
'http':'http://10.10.1.10:3128',
'https':'https://10.10.1.10:1080'
}
requests.get('http://httpbin.org/ip',proxies=proxy)
print(response.text)

2 requests库使用案例

例子1: 采用requests实现百度搜索功能

# -*- coding: utf-8 -*-
__author__ = 'sun'
__date__ = '2019/6/19 14:47'
import requests

def getfromBaidu(key):
    #url = 'http://www.baidu.com.cn/s?wd=' + urllib.parse.quote(key) + '&pn='  # word为关键词，pn是分页。
    kv = {'wd': key}
    r = requests.get("http://www.baidu.com/s", params=kv)
    print(r.request.url)
    with open("baidu.html", "w", encoding='utf8')   as  f:
        f.write(r.text)

key = 'python'
getfromBaidu(key)

例子2：采用get和post方法

# -*- coding: utf-8 -*-  
__author__ = 'sun'
__date__ = '2019/6/19 下午9:32'

import requests 
import  json
r = requests.get(url='http://www.sina.com')  # 最基本的GET请求
print(r.status_code)  # 获取返回状态
r = requests.get(url='http://dict.baidu.com/s', params={'wd': 'python'})  # 带参数的GET请求
print(r.url)
print(r.text)  # 打印解码后的返回数据

print("#####################")
payload = (('key1', 'value1'), ('key1', 'value2'))
#urlencode
r = requests.post('http://httpbin.org/post', data=payload)

print("code: " + str(r.status_code) + ", text:" + r.text)

url = 'http://httpbin.org/post'
files = {'file': ('report.xls', open('report.xls', 'rb'), 'application/vnd.ms-excel', {'Expires': '0'})}
r = requests.post(url, files=files) 
print(r.text)

2 BeautifulSoup

简介

Beautiful Soup是python的一个库，最主要的功能是从网页抓取数据。官方解释如下：

Beautiful Soup提供一些简单的、python式的函数用来处理导航、搜索、修改分析树等功能。它是一个工具箱，通过解析文档为用户提供需要抓取的数据，因为简单，所以不需要多少代码就可以写出一个完整的应用程序。

Beautiful Soup自动将输入文档转换为Unicode编码，输出文档转换为utf-8编码。你不需要考虑编码方式，除非文档没有指定一个编码方式，这时，Beautiful Soup就不能自动识别编码方式了。然后，你仅仅需要说明一下原始编码方式就可以了。

Beautiful Soup已成为和lxml、html6lib一样出色的python解释器，为用户灵活地提供不同的解析策略或强劲的速度。

安装

Beautiful Soup 3 目前已经停止开发，推荐在现在的项目中使用Beautiful Soup 4，不过它已经被移植到BS4了，也就是说导入时我们需要 import bs4

进入python虚拟化环境，安装lxml和bs4

pip install lxml

pip install bs4

使用方法

首先必须要导入 bs4 库

from bs4 import BeautifulSoup

Beautiful Soup将复杂HTML文档转换成一个复杂的树形结构,每个节点都是Python对象,所有对象可以归纳为4种:

1. Tag
2. NavigableString
3. BeautifulSoup
4. Comment

语法：见附件《Beautiful Soup 4.2.0 文档 — Beautiful Soup.pdf》

例子分析

假设串为：

html_doc = """
<html>
<head>
    <title>The Dormouse's story</title>
</head>
<body>
<p class="title aq">
    <b>
        The Dormouse's story
    </b>
</p>
<p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> 
    and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.
</p>
<p class="story">...</p>
"""

生成soup对象：

soup = BeautifulSoup(html_doc, 'lxml')

(1) Tag

通俗点讲就是 HTML 中的一个个标签，例如

<title>The Dormouse's story</title>
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

上面的 title a 等等 HTML 标签加上里面包括的内容就是 Tag; 下面我们来感受一下怎样用 Beautiful Soup 来方便地获取 Tags

print(soup.title)
# <title>The Dormouse's story</title>

print(soup.head)
# <head><title>The Dormouse's story</title></head>

print(soup.a)
# <a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>

print(soup.p)
# <p class="title" name="dromouse"><b>The Dormouse's story</b></p>

print type(soup.a)
#<class 'bs4.element.Tag'>

对于 Tag，它有两个重要的属性，是 name 和 attrs

print(soup.name)
print(soup.head.name)
# [document]
# head

print soup.p.attrs
#{'class': ['title'], 'name': 'dromouse'}

print soup.p['class']
#['title']

print soup.p.get('class')   #等价于上述的
#['title']

可以对这些属性和内容等等进行修改，例如

soup.p['class'] = "newClass"
print(soup.p)
# <p class="newClass" name="dromouse"><b>The Dormouse's story</b></p>

复杂点的操作

# 获取所有文字内容
print(soup.get_text())

# 输出第一个  a 标签的所有属性信息
print(soup.a.attrs)

for link in soup.find_all('a'):
    # 获取 link 的  href 属性内容
    print(link.get('href'))

# 对soup.p的子节点进行循环输出    
for child in soup.p.children:
    print(child)

# 正则匹配，名字中带有b的标签
for tag in soup.find_all(re.compile("b")):
    print(tag.name)

（2） NavigableString

既然我们已经得到了标签的内容，那么问题来了，我们要想获取标签内部的文字怎么办呢？很简单，用 .string 即可，例如

print(soup.p.string)
#The Dormouse's story

案例2：

新建文件test.html

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>Hello</title>
</head>
<body>
   <div class="aaa" id="xxx">
       <p>Hello <span>world</span></p>
   </div>
   <div class="bbb" s="sss">bbbb1</div>
   <div class="ccc">ccc</div>
   <div class="ddd">dddd</div>
   <div class="eeee">eeeee</div>
</body>
</html>

测试python文件如下：

from bs4 import BeautifulSoup
import re
# 1. 创建BeautifulSoup对象
with open("test.html") as f:
    html_doc = f.read()

soup = BeautifulSoup(html_doc, 'lxml')
# 2. 按Tag name 找网页元素
print(f"2.:{soup.title}")
print(f"2.:{soup.title.string}")
# 3. 使用get_text()获取文本
print(f"3.get_text():{soup.div.get_text()}")
# 4. 如何获取属性
print("4.", soup.div['class'])
print("4.get", soup.div.get("class"))
print("4.attrs:", soup.div.attrs)
# 5. find_all(self, name=None, attrs={}, recursive=True, text=None,
#                 limit=None, **kwargs):
# 1) 获取所有符合过滤条件的Tag
# 2) 过滤条件可以是多个条件，也可以是单个条件
# 3）过滤条件支持正则表达式
# 4） 参数说明
# -name- : Tag name, 默认值是None
# -attrs-：字典，字典里可以放tag的多个属性。
# - recursive-：是否递归，默认值是True。
# - text-：按tag里面的文本内容找，也支持正则表达式，默认值是None
# - limit-: 限制找的个数，默认值是None即不限制个数，如果想限制只找前2个的话，
#   设置limit = 2即可。
# -kwargs - : 接受关键参数，可以指定特定的参数。例如： id = '',class_ = ''

divs = soup.find_all("div")
for div in divs:
    print("type(div)", type(div))
    print(div.get_text())
print(soup.find_all(name='div', class_='bbb'))
print("==", soup.find_all(limit=1, attrs={"class": re.compile('^b')}))
print(soup.find_all(text="bbbb1"))
print(soup.find_all(id="xxxx"))
# 6.find  limit =1 的find_all()
# 7.我们可以像使用find_all一样使用tag.( 按tagname找其实就是find_all的一个快捷方式)
soup.find_all(name='div', class_='bbb')
soup.div(class_='bbb')
# 注意：我们对Tag和BeautifulSoup类型的对象同等对待。
# 8. 查找当前Tag的子节点
# 1) 分多次查找
div_tag = soup.div
print(type(soup))
print(type(div_tag))
print(div_tag.p)
# 2）使用contents获得tag对象的子节点
print("8.2):", soup.div.contents)
# 9. children  返回  list_iterator 类型的对象
body_children = soup.body.children
for child in body_children:
    print("9. ", child)
# 10. 父节点
tag_p = soup.p
print("10.", tag_p.parent)

# 11. 兄弟节点find_next_siblings
# 找当前tag的下面的所有兄弟节点
div_ccc = soup.find(name='div',class_='ccc')
print("11.", div_ccc)
print("11.", div_ccc.find_next_siblings(name='div'))
# 12. 兄弟节点find_previous_siblings
print("12.", div_ccc.find_previous_siblings(name='div'))

soup.find_previous_sibling()

作业：
采用requests库爬取百度搜索页面，输入关键字，采用多线程或者多进程方式进行多页爬取

https://www.baidu.com/s?wd=python&pn=20

分页（页数为10页）爬取