Python爬虫（六）—解析利器 BeautifulSoup

前言

以下关于正则表达式 BeautifulSoup 学习，主要记录常用的知识点，深入了解的查看官方文档。

BeautifulSoup : https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/

BeautifulSoup 介绍

Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库。

安装beautifulsoup4
(env) pip install beautifulsoup4 -i https://pypi.doubanio.com/simple/
beautifulsoup4解析器
Beautiful Soup支持Python标准库中的HTML解析器,还支持一些第三方的解析器,其中一个是 lxml .根据操作系统不同,可以选择下列方法来安装lxml。
其中除了下载beautifulsoup4模块，如果需要lxml解析器（第二第三种）还需要下载lxml：
pip install lxml -i https://pypi.doubanio.com/simple/

第四种html5lib解析器还需要安装 html5lib 模块：
pip install html5lib -i https://pypi.doubanio.com/simple/

下面列举了常用的解析器：

解析器	使用方法	优势	劣势
Python标准库	BeautifulSoup(markup, “html.parser”)	Python的内置标准库执行速度适中文档容错能力强	Python 2.7.3 or 3.2.2)前的版本中文档容错能力差
lxml HTML 解析器	BeautifulSoup(markup, “lxml”)	速度快文档容错能力强	需要安装C语言库
lxml XML 解析器	BeautifulSoup(markup, [“lxml”, “xml”]) BeautifulSoup(markup, “xml”)	速度快唯一支持XML的解析器	需要安装C语言库
html5lib	BeautifulSoup(markup, “html5lib”)	最好的容错性以浏览器的方式解析文档生成HTML5格式的文档	速度慢不依赖外部扩

beautifulsoup4 编码
- 通过Beautiful Soup输出文档时,不管输入文档是什么编码方式,输出编码均为UTF-8编码
- 如果不想用UTF-8编码输出,可以将编码方式传入 prettify() 方法 : print(soup.prettify(“latin-1”))

对象的种类

下面简单介绍一下Beautiful转化成Python的对象种类：Tag , NavigableString , BeautifulSoup , Comment

Tag
tag的属性操作方法与字典一样，tag中最重要的属性: name和attributes：
tag.name # 获取
tag.name = “blockquote” # 修改
tag[‘class’] # 获取class属性
tag.attrs
del tag[‘class’] # 删除属性
rel_soup.a[‘rel’] = [‘index’, ‘contents’] # 修改属性 Back to the < a rel=“index contents”>homepage< /a>

一种属性存在多值的情况：
css_soup.p[‘class’] # [“body”, “strikeout”]
某个属性看起来好像有多个值,但在任何版本的HTML定义中都没有被定义为多值属性,那么Beautiful Soup会将这个属性作为字符串返回，例如id属性。
id_soup.p[‘id’] # ‘my id’
可以遍历的字符串 NavigableString
字符串常被包含在tag内.Beautiful Soup用 NavigableString 类来包装tag中的字符串.
tag中包含的字符串不能编辑,但是可以被替换成其它的字符串,用 replace_with() 方法
BeautifulSoup
BeautifulSoup 对象表示的是一个文档的全部内容.大部分时候,可以把它当作 Tag 对象,它支持遍历文档树和搜索文档树中描述的大部分的方法。
soup.name # u’[document]’
注释及特殊字符串 Comment
Tag , NavigableString , BeautifulSoup 几乎覆盖了html和xml中的所有内容，但是还有一些特殊对象，容易让人担心的内容是文档的注释部分。

# 以下代码是四个对象种类的演示

soup = BeautifulSoup('<b class="boldest b">Extremely bold</b>')
tag = soup.b
type(tag)  # <class 'bs4.element.Tag'>
tag.name  # b
tag.attr  # {'class': ['boldest']}
tal[class]  # ['boldest']

print(type(tag.string))  # <class 'bs4.element.NavigableString'>
tag.string.replace_with("No longer bold")
print(tag.string)  # No longer bold

print(soup.name)  # [document]

markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>"
soup = BeautifulSoup(markup, 'lxml')
comment = soup.b.string
print(type(comment))  # <class 'bs4.element.Comment'>

警告 UserWarning: No parser was explicitly specified ，原因是未指定解析器。

遍历文档树

“爱丽丝梦游仙境”的文档来做例子:

html_doc = """
<html><head><title>The Dormouse's story</title></head>

<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc)

子节点
一个Tag可能包含多个字符串或其它的Tag,这些都是这个Tag的子节点.Beautiful Soup提供了许多操作和遍历子节点的属性：
# 获取标签
soup.head # The Dormouse’s story
# 获取< body>标签中的第一个标签
soup.body.b # The Dormouse’s story

想要得到所有的< a>标签,或是通过名字得到比一个tag更多的内容的时候,就需要用到 Searching the tree 中描述的方法,比如: find_all()
- .contents 、 .children 、 .descendants
 - tag的 .contents 属性可以将tag的子节点以列表的方式输出，
 - .children产生生成器,可以对tag的子节点进行循环。
 - .contents 和 .children 属性仅包含tag的直接子节点
 - .descendants 属性可以对所有tag的子孙节点进行递归循环
- .string 、 .strings 、 stripped_strings
 - 如果tag只有一个 NavigableString 类型子节点,那么这个tag可以使用 .string 得到子节点：title_tag.string
 - 如果tag中包含多个字符串 [2] ,可以使用 .strings 来循环获取
 - 输出的字符串中可能包含了很多空格或空行,使用 .stripped_strings 可以去除多余空白内容
父节点
- 直接父节点：.parent
- 所有父节点：.parents

link = soup.a
for parent in link.parents:
    if parent is None:
        print(parent)
    else:
        print(parent.name)

"""
p
body
html
[document]
"""

兄弟节点
- .next_siblings 和 .previous_siblings
  通过 .next_siblings 和 .previous_siblings 属性可以对当前节点的兄弟节点迭代输出
- .next_sibling 和 .previous_sibling
  下一个兄弟节点： .next_sibling
  前一个兄弟节点： .previous_sibling
  没有返回 None
  有一种情况是如以下代码

html = """
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a><a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
"""
sibling_soup = BeautifulSoup(html, 'lxml')
print(sibling_soup.a.next_sibling)   # 输出空白
print(sibling_soup.a.next_sibling.next_sibling)  # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>

搜索文档树 find() 、 find_all()

Beautiful Soup定义了很多搜索方法,这里着重介绍2个: find() 和 find_all()。

过滤器
介绍 find_all() 方法前,先介绍一下过滤器的类型，这些过滤器贯穿整个搜索的API.过滤器可以被用在tag的name中,节点的属性中,字符串中或他们的混合中。
- 字符串： soup.find_all(‘b’) # 查找文档中所有的标签
- 正则表达式： soup.find_all(re.compile("^b")) # 找出所有以b开头的标签,例如和标签都应该被找到
- 列表： soup.find_all([“a”, “b”]) # 文档中所有< a>标签和标签
- True： soup.find_all(True) # 查找到所有的tag,但是不会返回字符串节点
- 方法：如果没有合适过滤器,那么还可以定义一个方法,方法只接受一个元素参数,如果这个方法返回 True 表示当前元素匹配并且被找到,如果不是则反回 False。
 下面方法校验了当前元素,如果包含 class 属性却不包含 id 属性,那么将返回 True:

def has_class_but_no_id(tag):
    return tag.has_attr('class') and not tag.has_attr('id')

soup.find_all(has_class_but_no_id)
# [<p class="title"><b>The Dormouse's story</b></p>,
#  <p class="story">Once upon a time there were...</p>,
#  <p class="story">...</p>]

下面代码找到所有被文字包含的节点内容:

from bs4 import NavigableString
def surrounded_by_strings(tag):
    return (isinstance(tag.next_element, NavigableString)
            and isinstance(tag.previous_element, NavigableString))

for tag in soup.find_all(surrounded_by_strings):
    print tag.name
# p
# a
# a
# a
# p

find_all()
find_all( name , attrs , recursive , text , **kwargs )

find_all() 方法搜索当前tag的所有tag子节点,并判断是否符合过滤器的条件

soup.find_all("title")
# [<title>The Dormouse's story</title>]

soup.find_all("p", "title")
# [<p class="title"><b>The Dormouse's story</b></p>]

soup.find_all("a")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.find_all(id="link2")
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

import re
soup.find(text=re.compile("sisters"))
# u'Once upon a time there were three little sisters; and their names were
'

参数含义
- name 参数
 name 参数可以查找所有名字为 name 的tag，soup.find_all(“title”) # # [The Dormouse’s story]
- keyword 参数
 如果一个指定名字的参数不是搜索内置的参数名,搜索时会把该参数当作指定名字tag的属性来搜索,如果包含一个名字为 id 的参数,Beautiful Soup会搜索每个tag的”id”属性。
 soup.find_all(id=‘link2’)
 soup.find_all(href=re.compile(“elsie”)) # 搜索每个tag的”href”属性
 soup.find_all(id=True) # 查找所有包含 id 属性的tag,无论 id 的值是什么.
 
 使用多个指定名字的参数可以同时过滤tag的多个属性
 soup.find_all(href=re.compile(“elsie”), id=‘link1’)
 
 有些tag属性在搜索不能使用,比如HTML5中的 data-* 属性，但是可以通过 find_all() 方法的 attrs 参数定义一个字典参数来搜索包含特殊属性的tag
 data_soup.find_all(attrs={“data-foo”: “value”})
- 按CSS搜索
 按照CSS类名搜索tag的功能非常实用,但标识CSS类名的关键字 class 在Python中是保留字,使用 class 做参数会导致语法错误.从Beautiful Soup的4.1.1版本开始,可以通过 class_ 参数搜索有指定CSS类名的tag:
 soup.find_all(“a”, class_=“sister”)
 
 class_ 参数同样接受不同类型的过滤器 ,字符串,正则表达式,方法或 True :
 soup.find_all(class_=re.compile(“itl”))
 
 tag的 class 属性是多值属性 .按照CSS类名搜索tag时,可以分别搜索tag中的每个CSS类名:
 css_soup.find_all(“p”, class_=“body”)
- text 参数
 通过 text 参数可以搜搜文档中的字符串内容.与 name 参数的可选值一样, text 参数接受字符串 , 正则表达式 , 列表, True 及多值查询。
 下面代码用来搜索内容里面包含“Elsie”的< a>标签:
 soup.find_all(“a”, text=“Elsie”)
- limit 参数
 find_all() 方法返回全部的搜索结构,如果文档树很大那么搜索会很慢.如果我们不需要全部结果,可以使用 limit 参数限制返回结果的数量
- recursive 参数
 调用tag的 find_all() 方法时,Beautiful Soup会检索当前tag的所有子孙节点,如果只想搜索tag的直接子节点,可以使用参数 recursive=False
- 像调用 find_all() 一样调用tag
 find_all() 几乎是Beautiful Soup中最常用的搜索方法,所以我们定义了它的简写方法. BeautifulSoup 对象和 tag 对象可以被当作一个方法来使用,这个方法的执行结果与调用这个对象的 find_all() 方法相同,下面两行代码是等价的:
 soup.find_all(“a”)
 soup(“a”)
 这两行代码也是等价的:
 soup.title.find_all(text=True)
 soup.title(text=True)
find()
find( name , attrs , recursive , text , **kwargs )
使用 find_all 方法并设置 limit=1 参数不如直接使用 find() 方法.
唯一的区别是 find_all() 方法的返回结果是值包含一个元素的列表,而 find() 方法直接返回结果
find_all() 方法没有找到目标是返回空列表, find() 方法找不到目标时,返回 None
下面两行代码是等价的:

soup.find_all('title', limit=1)
# [<title>The Dormouse's story</title>]

soup.find('title')
# <title>The Dormouse's story</title>

其他find函数
以下函数使用问题请直接查看官方文档。
- find_parents() 和 find_parent()
- find_next_siblings() 合 find_next_sibling()
- find_previous_siblings() 和 find_previous_sibling()
- find_all_next() 和 find_next()
- find_all_previous() 和 find_previous()

CSS选择器

Beautiful Soup支持大部分的CSS选择器，在 Tag 或 BeautifulSoup 对象的 .select() 方法中传入字符串参数,即可使用CSS选择器的语法找到tag:

通过tag标签逐层查找:
soup.select(“body a”)
soup.select(“html head title”)
找到某个tag标签下的直接子标签
soup.select(“p > #link1”)
soup.select(“head > title”)
soup.select(“head > title”)
找到兄弟节点标签:
下一兄弟：soup.select("#link1 ~ .sister")
上一兄弟：soup.select("#link1 + .sister")
通过CSS的类名、id查找:
soup.select(".sister")
soup.select("[class~=sister]")

soup.select("#link1")
soup.select(“a#link2”)
通过是否存在某个属性来查找:
soup.select(‘a[href]’)
通过属性的值来查找:
soup.select(‘a[href=“http://example.com/elsie”]’)
soup.select(‘a[href$=“tillie”]’)
soup.select(‘a[href*=".com/el"]’)
通过语言设置来查找:
multilingual_soup.select(‘p[lang|=en]’)

修改文档树

Beautiful Soup的强项是文档树的搜索,但同时也可以方便的修改文档树。
内容略，请直接查看官方文档： https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/#id40

个人博客：Loak 正 - 关注人工智能及互联网的个人博客
文章地址：Python爬虫（六）—解析利器 BeautifulSoup