几行代码能搞定的事不必去点鼠标

Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.

官方说了： Beautiful Soup 是一个能叫数据从 HTML 或 XML 文件中弄出来了 python 库。

什么数据？你想要的数据。
你想要什么数据？这个嘛，我不知道~

3、bs4 库怎么安装

简单，使用 pip 包管理器进行安装。

pip install bs4

什么？你没有 pip ？
那玩意可是安装 python 的时候就有的啊！
如果命令 not found 可能是你没配置环境变量。

4、bs4 库怎么使用

下节课讲，同学们下课！

二、小试牛刀：本节课讲如何使用 bs4

1、将代码变的易读

这个“库子”库如其名，美丽汤，是用来煲汤的。就算你不用提取数据，你也可以让你获取到的 HTMl 代码变的更易读一些。

标签都搞在一起真难看，哦不，是真难读。
看如下例子：

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.prettify())

上述代码导入了 bs4 库，然后创建了一个HTML文档字符串，之后根据其创建了一个 BeautifulSoup 对象，最后进行 prettify()输出。

注意啊，最主要的就是这个 prettify() ，就是它让代码进行格式化的。
执行结果如下图：

格式化代码

2、对标签进行查找

如果只是让代码变好看一点也太没意思了。代码写来是让用的，又不是主要用来读的。
能在这段代码中迅速找到我需要的标签及其内容才是关键。

在上一个例子代码的基础上，后续也是，不在赘述

print(soup.title)
print(soup.title.name)
print(soup.title.parent.name)
print(soup.p)
print(soup.p['class'])
print(soup.a)
print(soup.find_all('a'))
print(soup.find_all(id='link3'))

这里的代码我没有注释，但是你一看应该就知道什么意思。
执行结果如下图：

执行结果
注：其中第七行没有截图完整，它是一个列表。代码查询了所有 a 标签并输出。

3、一个查找网页所有超链接的案例

for link in soup.find_all('a'):
    print(link.get('href'))

执行结果如下图：

执行结果

4、获取网页所有文本

print(soup.get_text())

执行结果如下图：

执行结果

5、总结

怎么样，很简单吧。
如果感觉还可以，继续往下看。

这时我们可以了解一下 解析器 的概念了。解析器是 bs4 用来解析 HTML 文档的，它默认使用的是 python 的标准库，也就是上述代码所用到的 html.parser。

这个解析器或者说库是你安装python的时候就有的，但是你仍然可以安装一些更为强大的第三方解析器，比如说：lxml、html5lib。

如果你想安装 lxml 等解析器，你可以这样做：

# 使用 pip 进行安装解析器
pip install lxml

配一个从官方弄来的图：
解析器

三、初露锋芒：进阶使用 bs4

1、我拿什么来做汤

好汤自然要好料，可是直接传递 HTML ，也可以传递一个打开的文件句柄

# 传递一个打开的文件句柄
from bs4 import BeautifulSoup

with open('index.html') as fp:
    soup = BeautifulSoup(fp)

# 传递一个 HTML 文档
soup = BeautifulSoup("<html>data</html>")

注：这里的代码截图和上述代码有少许出入，但表达意思相同。

执行结果

2、汤里面都是有什么

1) 有 Tag

注：这个 Tag 就是指 HTML 里面的那个 Tag ，但是他们的类型不同！

soup = BeautifulSoup('<b class="boldest">Extremely bold</b>')
tag = soup.b
type(tag)
# <class 'bs4.element.Tag'>

这个 Tag 有很多属性，比如

属性	描述
`name`	`tag.name`
`id`	`tag['id']`

注1：如上表， Tag 有很多属性，你可以像对待字典一样获取这些 Tag 的属性（用 ['key']的方式）。
注2：你可以添加、修改、删除这些属性，而不仅仅只是查看。（修改的话直接赋值即可）

示例：

tag['id'] = 'verybold'
tag['another-attribute'] = 1
tag
# <b another-attribute="1" id="verybold"></b>

del tag['id']
del tag['another-attribute']
tag
# <b></b>

tag['id']
# KeyError: 'id'
print(tag.get('id'))
# None

2) 如果 Tag 有多个属性怎么办

没关系，BeautifulSoup 会返回一个列表以包含这些值。

css_soup = BeautifulSoup('<p class="body"></p>')
css_soup.p['class']
# ["body"]

css_soup = BeautifulSoup('<p class="body strikeout"></p>')
css_soup.p['class']
# ["body", "strikeout"]

但是，如果有些属性只是长得像多个属性，但实际不是。怎么办
没关系，这种情况不会误判：

id_soup = BeautifulSoup('<p id="my id"></p>')
id_soup.p['id']
# 'my id'

但有时你确实是一个属性被当做两个了，怎么办
你可以通过指明 multi_valued_attributes=None 来表明一下你的态度：

no_list_soup = BeautifulSoup('<p class="body strikeout"></p>', 'html', multi_valued_attributes=None)
no_list_soup.p['class']
# u'body strikeout'

你还可以通过指定 get_attribute_list参数来表示你想要返回列表形式的结果
可以，满足你：

id_soup.p.get_attribute_list('id')
# ["my id"]

如果你指明了要当做 xml 来解析，它也不会被当做多个值：

xml_soup = BeautifulSoup('<p class="body strikeout"></p>', 'xml')
xml_soup.p['class']
# u'body strikeout'

问题又来了，如果你指明 xml 但是还想让其处理多个值怎么办
好办，说出是你的想法（带上 multi_valued_attributes=class_is_multi ）：

class_is_multi= { '*' : 'class'}
xml_soup = BeautifulSoup('<p class="body strikeout"></p>', 'xml', multi_valued_attributes=class_is_multi)
xml_soup.p['class']
# [u'body', u'strikeout']

3) 替换字符串

tag.string.replace_with("No longer bold")
tag
# <blockquote>No longer bold</blockquote>

3、如何在 Tag 间遨游

假如这里有一个这样的文档：

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

1) 直接喊名字

是什么就喊什么，多简单的事儿

soup.head
# <head><title>The Dormouse's story</title></head>

soup.title
# <title>The Dormouse's story</title>

2) 顺着他爹往下找

soup.body.b
# <b>The Dormouse's story</b>

3) 如果有多个 Tag 匹配值只返回第一个

soup.a
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

4) 如果你想返回所有匹配值

么问题，那你用 find_all('Tag_nam')就行了

soup.find_all('a')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

5) 有多少个子孙：禁止套娃！

一个 Tag 有多少个后代都在家谱（.contents）的列表（List）里面记着：

head_tag = soup.head
head_tag
# <head><title>The Dormouse's story</title></head>

head_tag.contents
[<title>The Dormouse's story</title>]

title_tag = head_tag.contents[0]
title_tag
# <title>The Dormouse's story</title>
title_tag.contents
# [u'The Dormouse's story']

类似的还有 .children ，不同在于 children 不包含文本和注释节点

6) 递归迭代子孙：可以套娃！

.contents 和 .children 只管儿，孙子及以下都不管了。
举个例子：

head_tag.contents
# [<title>The Dormouse's story</title>]

上述代码中，两个 <title> 中间的内容可以看做是 <title> 的儿子，但是 contents 和 children 它不管啊！

如果想管，还是 descendants 靠谱：

for child in head_tag.descendants:
    print(child)
# <title>The Dormouse's story</title>
# The Dormouse's story

7) 底层人员的活路在哪

如果一个 Tag 只有一个 child，而且还是个 NavigableString 类型。那么，可怜可怜，给你一个 .string 属性吧。

title_tag.string
# u'The Dormouse's story'

那么问题来了，如果一个 Tag 只有一个 child，正好这个 child 只有一个 NavigableString 类型的 child，怎么办？
行吧，都是难民，一样的待遇吧，
通过它爷可以直接通过 .string 访问它：

head_tag.contents
# [<title>The Dormouse's story</title>]

head_tag.string
# u'The Dormouse's story'

但是如果你有多个后代的时候，这个时候 .string 就不明确了对不对，怎么办？
摊手，返回 None 吧，我也没办法：

print(soup.html.string)
# None

可是如果我就是想一下子访问所有的 .string 怎么办？
那只有想办法喽，来个 .strings 属性？
好，就这么愉快的决定了：

for string in soup.strings:
    print(repr(string))
# u"The Dormouse's story"
# u'

'
# u"The Dormouse's story"
# u'

'
# u'Once upon a time there were three little sisters; and their names were
'
# u'Elsie'
# u',
'
# u'Lacie'
# u' and
'
# u'Tillie'
# u';
and they lived at the bottom of a well.'
# u'

'
# u'...'
# u'
'

我去，你这拿好多空白符来糊弄我？
怎么办？
干他！
上金钟罩 stripped_strings 把空白符全干了~

for string in soup.stripped_strings:
    print(repr(string))
# u"The Dormouse's story"
# u"The Dormouse's story"
# u'Once upon a time there were three little sisters; and their names were'
# u'Elsie'
# u','
# u'Lacie'
# u'and'
# u'Tillie'
# u';
and they lived at the bottom of a well.'
# u'...'

8) 不行，有人上访

光儿子、孙子了，怎么没听见有人喊爸爸？
你看，说爸爸，爸…

通过 .parent可以上访，这是正当权利！

title_tag = soup.title
title_tag
# <title>The Dormouse's story</title>
title_tag.parent
# <head><title>The Dormouse's story</title></head>

最底层的 string 的爸爸是它的上一级标签：

title_tag.string.parent
# <title>The Dormouse's story</title>

那最顶层的爸爸是谁？
是如来！
金身在此！

html_tag = soup.html
type(html_tag.parent)
# <class 'bs4.BeautifulSoup'>

那如来了爸爸是谁？
你在无中生有，胡编乱造~~

print(soup.parent)
# None

什么？你竟然嫌爸爸少？
那好，
爸爸们来了！
通过 .parents 可以访问爸爸们~

link = soup.a
link
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
for parent in link.parents:
    if parent is None:
        print(parent)
    else:
        print(parent.name)
# p
# body
# html
# [document]
# None

9) 有事还得亲兄弟

光说儿子爸爸了，搞得兄弟好像不亲似的。
假如有这么一个文档;

sibling_soup = BeautifulSoup("<a><b>text1</b><c>text2</c></b></a>")
print(sibling_soup.prettify())
# <html>
#  <body>
#   <a>
#    <b>
#     text1
#    </b>
#    <c>
#     text2
#    </c>
#   </a>
#  </body>
# </html>

你可以通过 next_sibling 访问弟弟，通过 previous_sibling 访问大哥：

sibling_soup.b.next_sibling
# <c>text2</c>

sibling_soup.c.previous_sibling
# <b>text1</b>

那大哥的大哥是谁？小弟的小弟是谁？
挖草，你又在无中生有：

print(sibling_soup.b.previous_sibling)
# None
print(sibling_soup.c.next_sibling)
# None

不过说清了啊，不一个爸爸可不能算兄弟哦：

sibling_soup.b.string
# u'text1'

print(sibling_soup.b.string.next_sibling)
# None

还记得那年的夏鸣湖畔吗？
你以为第一个 <a> 标签的 .next_sibling 是下一个 <a>，
你错了！
你以为我换行符不存在的吗？

link = soup.a
link
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

link.next_sibling
# u',
'

想找你的兄弟？
下一次吧！

link.next_sibling.next_sibling
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>

10) 梁山聚义当多人

弟弟们哥哥们何在？
挨个迭代，今天谁也跑不掉！

for sibling in soup.a.next_siblings:
    print(repr(sibling))
# u',
'
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
# u' and
'
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
# u'; and they lived at the bottom of a well.'
# None

for sibling in soup.find(id="link3").previous_siblings:
    print(repr(sibling))
# ' and
'
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
# u',
'
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
# u'Once upon a time there were three little sisters; and their names were
'
# None

四、更近一步：精准匹配

假如有这么一个文档：

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

1、当一个快乐的伐树工

此树非彼树，此 DOM 树是也。

soup = Beautiful(html, 'html.parser')

# 查找第一个符合条件的值
soup.find('a')

# 查找所有符合条件的值， 返回列表
soup.find_all('a')

2、正则倚天剑

如果你提供一个正则表达式对象参数， bs4 将会使用正则的 search() 函数进行查找。

# 这个例子查找所有以 b 开头的标签
import re
for tag in soup.find_all(re.compile("^b")):
    print(tag.name)
# body
# b

# 这个例子查找所有以 t 开头的标签
for tag in soup.find_all(re.compile("t")):
    print(tag.name)
# html
# title

3、列表匹配全

如果你提供一个列表参数， bs4 将匹配所有符合的值

# 在这个例子中，将匹配到所有的 a 标签和 b 标签
soup.find_all(["a", "b"])
# [<b>The Dormouse's story</b>,
#  <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

4、当真主降临

如果你提供一个 True 作为参数，那么将会匹配到所有的 Tags

for tag in soup.find_all(True):
    print(tag.name)
# html
# head
# title
# body
# p
# b
# p
# a
# a
# a
# p

5、函数大法好

如果上面的各种匹配方法都不适合你，那么，咱自己定义一个模式吧？

你可以定义一个函数，来判断一个标签是否符合你想要的标准，然后返回一个布尔值。
比如你可以这样：

def has_class_but_no_id(tag):
    return tag.has_attr('class') and not tag.has_attr('id')
    
soup.find_all(has_class_but_no_id)
# [<p class="title"><b>The Dormouse's story</b></p>,
#  <p class="story">Once upon a time there were…bottom of a well.</p>,
#  <p class="story">...</p>]

你还可以根据 Tag 的属性来定义匹配函数：

def not_lacie(href):
    return href and not re.compile("lacie").search(href)
soup.find_all(href=not_lacie)
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

再来一个例子，判断一个 Tag 是否被字符对象包围：

from bs4 import NavigableString
def surrounded_by_strings(tag):
    return (isinstance(tag.next_element, NavigableString)
            and isinstance(tag.previous_element, NavigableString))

for tag in soup.find_all(surrounded_by_strings):
    print tag.name
# p
# a
# a
# a
# p

五、渐入佳境：得心应手

find_all() 烦的奥，下面几个例子都以此为基础。

1、检索 Tag

# 用法:find_all(name, attrs, recursive, string, limit, **kwargs)
soup.find_all("title")
# [<title>The Dormouse's story</title>]

soup.find_all("p", "title")
# [<p class="title"><b>The Dormouse's story</b></p>]

soup.find_all("a")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.find_all(id="link2")
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

import re
soup.find(string=re.compile("sisters"))
# u'Once upon a time there were three little sisters; and their names were
'

soup.find_all(id=True)
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

# 需要注意的是，有些属性是不可以直接用的，像 data-xxx 比如：
data_soup = BeautifulSoup('<div data-foo="value">foo!</div>')
data_soup.find_all(data-foo="value")
# SyntaxError: keyword can't be an expression

# 还有一些例外比如：
name_soup = BeautifulSoup('<input name="email"/>')
name_soup.find_all(name="email")
# []
name_soup.find_all(attrs={"name": "email"})
# [<input name="email"/>]

2、`css` 选择器

你可以使用 css 选择器检索符合你指定 css class 的标签。

但由于 class 是 python 的保留字，所以 bs4 使用 class_ 来代替。

soup.find_all("a", class_="sister")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

事实上，你可以使用字符串、正则、函数或 True 作为 class_ 的值:

soup.find_all(class_=re.compile("itl"))
# [<p class="title"><b>The Dormouse's story</b></p>]

def has_six_characters(css_class):
    return css_class is not None and len(css_class) == 6

soup.find_all(class_=has_six_characters)
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

注：当一个 Tag 有多个 css class 属性的时候，你指定其中一个就可以中标！

# 如果一个 Tag 有多个 css class 属性的时候，你可以写完：
css_soup.find_all("p", class_="body strikeout")
# [<p class="body strikeout"></p>]

# 但是如果你顺序写反了，你会竹篮打水一场空
css_soup.find_all("p", class_="strikeout body")
# []

# 所以如果你想同时指定多个 css class 还不想一场空，你可以使用 css 选择器：
css_soup.select("p.strikeout.body")
# [<p class="body strikeout"></p>]

3、Tag 并不是你的唯一

你还可以直接检索 string

soup.find_all(string="Elsie")
# [u'Elsie']

soup.find_all(string=["Tillie", "Elsie", "Lacie"])
# [u'Elsie', u'Lacie', u'Tillie']

soup.find_all(string=re.compile("Dormouse"))
[u"The Dormouse's story", u"The Dormouse's story"]

def is_the_only_string_within_a_tag(s):
    """Return True if this string is the only child of its parent tag."""
    return (s == s.parent.string)

soup.find_all(string=is_the_only_string_within_a_tag)
# [u"The Dormouse's story", u"The Dormouse's story", u'Elsie', u'Lacie', u'Tillie', u'...']

soup.find_all("a", string="Elsie")
# [<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>]

4、limit 限制检索的个数

soup.find_all("a", limit=2)
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

5、recursive 开始套娃

你可以通过 mytag.find_all() 的方式来指定检索某个标签下的 Tag，如果同时指定了 recursive=False 则不进行递归。

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
...

soup.html.find_all("title")
# [<title>The Dormouse's story</title>]

soup.html.find_all("title", recursive=False)
# []

6、find_all 的简化

虽然 find_all 已经那么好用了，但是官方竟然又把它简化了。

# 下面两句作用相同
soup.find_all("a")
soup("a")

# 下面两句作用相同
soup.title.find_all(string=True)
soup.title(string=True)

六、心浮气躁：静下心来

虽然 find_all 异常好用，但是有时候比如你只想得到第一个结果。与其每次使用 find_all 的同时指定 limit = 1 ，不如直接使用 find()。

1、初识 find

soup.find_all('title', limit=1)
# [<title>The Dormouse's story</title>]

soup.find('title')
# <title>The Dormouse's story</title>

与 find_all() 的不同：

find_all() 返回列表，find() 直接返回结果
无结果时，find_all() 返回空列表，find() 返回 None

七、Enjoy！

有后续…