python爬虫之beautifulsoup的使用

一、Beautiful Soup的简介

　　简单来说，Beautiful Soup是python的一个库，最主要的功能是从网页抓取数据。官方解释如下：

Beautiful Soup提供一些简单的、python式的函数用来处理导航、搜索、修改分析树等功能。它是一个工具箱，通过解析文档为用户提供需要抓取的数据，因为简单，所以不需要多少代码就可以写出一个完整的应用程序。

Beautiful Soup自动将输入文档转换为Unicode编码，输出文档转换为utf-8编码。你不需要考虑编码方式，除非文档没有指定一个编码方式，这时，Beautiful Soup就不能自动识别编码方式了。然后，你仅仅需要说明一下原始编码方式就可以了。
Beautiful Soup已成为和lxml、html6lib一样出色的python解释器，为用户灵活地提供不同的解析策略或强劲的速度。

二、Beautiful Soup的下载与安装　

 1 #安装 Beautiful Soup
 2 pip install beautifulsoup4
 3 
 4 #安装解析器
 5 Beautiful Soup支持Python标准库中的HTML解析器,还支持一些第三方的解析器,其中一个是 lxml .根据操作系统不同,可以选择下列方法来安装lxml:
 6 
 7 $ apt-get install Python-lxml
 8 
 9 $ easy_install lxml
10 
11 $ pip install lxml
12 
13 另一个可供选择的解析器是纯Python实现的 html5lib , html5lib的解析方式与浏览器相同,可以选择下列方法来安装html5lib:
14 
15 $ apt-get install Python-html5lib
16 
17 $ easy_install html5lib
18 
19 $ pip install html5lib

三、 Beautiful Soup的简单使用

 1 '''
 2 pip3 install beautifulsoup4  # 安装bs4
 3 pip3 install lxml  # 下载lxml解析器
 4 '''
 5 html_doc = """
 6 <html><head><title>The Dormouse's story</title></head>
 7 <body>
 8 <p class="sister"><b>$37</b></p>
 9 <p class="story" id="p">Once upon a time there were three little sisters; and their names were
10 <a href="http://example.com/elsie" class="sister" >Elsie</a>,
11 <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
12 <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
13 and they lived at the bottom of a well.</p>
14 
15 <p class="story">...</p>
16 """
17 
18 # 从bs4中导入BeautifulSoup
19 from bs4 import BeautifulSoup
20 
21 # 调用BeautifulSoup实例化得到一个soup对象
22 # 参数一: 解析文本
23 # 参数二:
24 # 参数二: 解析器（html.parser、lxml...）
25 soup = BeautifulSoup(html_doc, 'lxml')
26 
27 print(soup)
28 print('*' * 100)
29 print(type(soup))
30 print('*' * 100)
31 # 文档美化
32 html = soup.prettify()
33 print(html)

四、 Beautiful Soup之遍历文档树

 1 html_doc = """
 2 <html><head><title>The Dormouse's story</title></head>
 3 <body>
 4 <p class="sister"><b>$37</b></p>
 5 <p class="story" id="p">Once upon a time there were three little sisters; and their names were
 6 <a href="http://example.com/elsie" class="sister" >Elsie</a>,
 7 <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
 8 <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
 9 and they lived at the bottom of a well.</p>
10 
11 <p class="story">...</p>
12 """
13 from bs4 import BeautifulSoup
14 soup = BeautifulSoup(html_doc, 'lxml')
15 
16 '''
17 遍历文档树：
18     1、直接使用
19     2、获取标签的名称
20     3、获取标签的属性
21     4、获取标签的内容
22     5、嵌套选择
23     6、子节点、子孙节点
24     7、父节点、祖先节点
25     8、兄弟节点
26 '''
27 
28 # 1、直接使用
29 print(soup.p)  # 查找第一个p标签
30 print(soup.a)  # 查找第一个a标签
31 
32 # 2、获取标签的名称
33 print(soup.head.name)  # 获取head标签的名称
34 
35 # 3、获取标签的属性
36 print(soup.a.attrs)  # 获取a标签中的所有属性
37 print(soup.a.attrs['href'])  # 获取a标签中的href属性
38 
39 # 4、获取标签的内容
40 print(soup.p.text)  # $37
41 
42 # 5、嵌套选择
43 print(soup.html.head)
44 
45 # 6、子节点、子孙节点
46 print(soup.body.children)  # body所有子节点，返回的是迭代器对象
47 print(list(soup.body.children))  # 强转成列表类型
48 
49 print(soup.body.descendants)  # 子孙节点
50 print(list(soup.body.descendants))  # 子孙节点
51 
52 #  7、父节点、祖先节点
53 print(soup.p.parent)  # 获取p标签的父亲节点
54 # 返回的是生成器对象
55 print(soup.p.parents)  # 获取p标签所有的祖先节点
56 print(list(soup.p.parents))
57 
58 # 8、兄弟节点
59 # 找下一个兄弟
60 print(soup.p.next_sibling)
61 # 找下面所有的兄弟，返回的是生成器
62 print(soup.p.next_siblings)
63 print(list(soup.p.next_siblings))
64 
65 # 找上一个兄弟
66 print(soup.a.previous_sibling)  # 找到第一个a标签的上一个兄弟节点
67 # 找到a标签上面的所有兄弟节点
68 print(soup.a.previous_siblings)  # 返回的是生成器
69 print(list(soup.a.previous_siblings))

四、 Beautiful Soup之搜索文档树

  1 html_doc = """
  2 <html><head><title>The Dormouse's story</title></head>
  3 <body>
  4 <p class="sister"><b>$37</b></p>
  5 <p class="story" id="p">Once upon a time there were three little sisters; and their names were
  6 <a href="http://example.com/elsie" class="sister" >Elsie</a>,
  7 <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
  8 <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
  9 and they lived at the bottom of a well.</p>
 10 
 11 <p class="story">...</p>
 12 """
 13 '''
 14 搜索文档树:
 15     find()  找一个  
 16     find_all()  找多个
 17     
 18 标签查找与属性查找:
 19     标签:
 20             name 属性匹配
 21             attrs 属性查找匹配
 22             text 文本匹配
 23             
 24         - 字符串过滤器   
 25             字符串全局匹配
 26 
 27         - 正则过滤器
 28             re模块匹配
 29 
 30         - 列表过滤器
 31             列表内的数据匹配
 32 
 33         - bool过滤器
 34             True匹配
 35 
 36         - 方法过滤器
 37             用于一些要的属性以及不需要的属性查找。
 38 
 39     属性:
 40         - class_
 41         - id
 42 '''
 43 
 44 from bs4 import BeautifulSoup
 45 soup = BeautifulSoup(html_doc, 'lxml')
 46 
 47 # # 字符串过滤器
 48 # name
 49 p_tag = soup.find(name='p')
 50 print(p_tag)  # 根据文本p查找某个标签
 51 # # 找到所有标签名为p的节点
 52 tag_s1 = soup.find_all(name='p')
 53 print(tag_s1)
 54 #
 55 #
 56 # # attrs
 57 # # 查找第一个class为sister的节点
 58 p = soup.find(attrs={"class": "sister"})
 59 # print(p)
 60 # # 查找所有class为sister的节点
 61 tag_s2 = soup.find_all(attrs={"class": "sister"})
 62 print(tag_s2)
 63 
 64 
 65 # text
 66 text = soup.find(text="$37")
 67 print(text)
 68 #
 69 #
 70 # # 配合使用:
 71 # # 找到一个id为link2、文本为Lacie的a标签
 72 a_tag = soup.find(name="a", attrs={"id": "link2"}, text="Lacie")
 73 print(a_tag)
 74 
 75 
 76 
 77 # # 正则过滤器
 78 import re
 79 # name
 80 p_tag = soup.find(name=re.compile('p'))
 81 print(p_tag)
 82 
 83 # 列表过滤器
 84 import re
 85 # name
 86 tags = soup.find_all(name=['p', 'a', re.compile('html')])
 87 print(tags)
 88 
 89 # - bool过滤器
 90 # True匹配
 91 # 找到有id的p标签
 92 p = soup.find(name='p', attrs={"id": True})
 93 print(p)
 94 
 95 # 方法过滤器
 96 # 匹配标签名为a、属性有id没有class的标签
 97 def have_id_class(tag):
 98     if tag.name == 'a' and tag.has_attr('id') and tag.has_attr('class'):
 99         return tag
100 
101 tag = soup.find(name=have_id_class)
102 print(tag)