Beautiful Soup的简介(1)

解析器使用方法优势劣势
Python标准库 BeautifulSoup(markup, “html.parser”)
  • Python的内置标准库
  • 执行速度适中
  • 文档容错能力强
  • Python 2.7.3 or 3.2.2)前 的版本中文档容错能力差
lxml HTML 解析器 BeautifulSoup(markup, “lxml”)
  • 速度快
  • 文档容错能力强
  • 需要安装C语言库
lxml XML 解析器 BeautifulSoup(markup, [“lxml”, “xml”])BeautifulSoup(markup, “xml”)
  • 速度快
  • 唯一支持XML的解析器
  • 需要安装C语言库
html5lib BeautifulSoup(markup, “html5lib”)
  • 最好的容错性
  • 以浏览器的方式解析文档
  • 生成HTML5格式的文档
  • 速度慢
  • 不依赖外部扩展

基本使用

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')        # 使用lxml解析器创建对本地html文件创建对象
print(soup.prettify())            # 补全文件中未闭合的标签
print(soup.title.string)           # 打印出The Dormouse's story

标签选择器

选择元素

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html5lib')        # 使用html5lib解析器创建对本地html文件创建对象
print(soup.title)
print(type(soup.title))
print(soup.head)
print(soup.p)

<title>The Dormouse's story</title>
<class 'bs4.element.Tag'>
<head><title>The Dormouse's story</title></head>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
以p标签为例,这种选择的方法只会选择第一个标签,其他的标签不会显示出来

获取名称

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html5lib')        # 使用html5lib解析器创建对本地html文件创建对象
print(soup.title.name)                # 这里只显示最外层的标签名称

title

获取属性

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html5lib')        # 使用html5lib解析器创建对本地html文件创建对象
print(soup.p.attrs['name'])
print(soup.p['name'])

dromouse

dromouse


获取内容

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html5lib')        # 使用html5lib解析器创建对本地html文件创建对象
print(soup.p.string)
The Dormouse's story

嵌套选择

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html5lib')        # 使用html5lib解析器创建对本地html文件创建对象
print(soup.head.title.string)
The Dormouse's story

子节点和子孙节点

html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
<body>
    <p class="story">
        Once upon a time there were three little sisters; and their names were
        <a href="http://example.com/elsie" class="sister" id="link1">
            <span>Elsie</span>
        </a>
        <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> 
        and
        <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
        and they lived at the bottom of a well.
    </p>
    <p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html5lib')        # 使用lxml解析器创建对本地html文件创建对象
print(soup.p.contents)                # contents来选择子节点,输出列表的方式,都放在一个列表内

[' Once upon a time there were three little sisters; and their names were ', <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>, ' ', <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, ' and ', <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>, '; and they lived at the bottom of a well. ']

html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
<body>
    <p class="story">
        Once upon a time there were three little sisters; and their names were
        <a href="http://example.com/elsie" class="sister" id="link1">
            <span>Elsie</span>
        </a>
        <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> 
        and
        <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
        and they lived at the bottom of a well.
    </p>
    <p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html5lib')        # 使用lxml解析器创建对本地html文件创建对象
print(soup.p.children)        # 同样都是获取子节点.children是迭代器的形式,不同于contens(放置在一个列表内),只能用循环的方式才能列出节点内容
    for i, child in enumerate(soup.p.children):
    print(i, child)

<list_iterator object at 0x104858a90>
0
Once upon a time there were three little sisters; and their names were

1 <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
2

3 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
4
and

5 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
6 ;
and they lived at the bottom of a well.

html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
            and
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
        <p class="story">...</p>
    </body>
</html>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html5lib')
print(soup.p.descendants)
for i, child in enumerate(soup.p.descendants):    #.descendants类似child的方式,同样都是一个迭代的方式,不同的是将子孙节点也分类得更清楚
    print(i, child)
<generator object descendants at 0x10744f0a0>
0 
            Once upon a time there were three little sisters; and their names were
            
1 <a class="sister" href="http://example.com/elsie" id="link1">
                <span>Elsie</span>
            </a>
2 
                
3 <span>Elsie</span>
4 Elsie
5 
            
6 
            
7 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
8 Lacie
9 
            and
            
10 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
11 Tillie
12 
            and they lived at the bottom of a well.

父节点和祖先节点
html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
            and
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
        <p class="story">...</p>
    </body>
</html>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html5lib')
print(soup.a.parent)        #使用parent获取父节点的内容
<p class="story">
            Once upon a time there were three little sisters; and their names were
            <a class="sister" href="http://example.com/elsie" id="link1">
                <span>Elsie</span>
            </a>
            <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
            and
            <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
            and
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
        <p class="story">...</p>
    </body>
</html>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html5lib')
print(list(enumerate(soup.a.parents)))    #使用枚举的方式列出所有的,一层一层地向外包围
[(0, <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a class="sister" href="http://example.com/elsie" id="link1">
                <span>Elsie</span>
            </a>
            <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
            and
            <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>), (1, <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a class="sister" href="http://example.com/elsie" id="link1">
                <span>Elsie</span>
            </a>
            <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
            and
            <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
        <p class="story">...</p>
    

</body>), (2, <html><head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a class="sister" href="http://example.com/elsie" id="link1">
                <span>Elsie</span>
            </a>
            <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
            and
            <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
        <p class="story">...</p>
    

</body></html>), (3, <html><head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a class="sister" href="http://example.com/elsie" id="link1">
                <span>Elsie</span>
            </a>
            <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
            and
            <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
        <p class="story">...</p>
    

</body></html>)]

兄弟节点与其并列的标签
html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
            and
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
        <p class="story">...</p>
    </body>
</html>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html5lib')
print(list(enumerate(soup.a.next_siblings))) #后面的兄弟节点
print(list(enumerate(soup.a.previous_siblings)))  #前面的兄弟节点
[(0, '
            '), (1, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>), (2, '
            and
            '), (3, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>), (4, '
            and they lived at the bottom of a well.
        ')]
[(0, '
            Once upon a time there were three little sisters; and their names were
            ')]

 

 
原文地址:https://www.cnblogs.com/ecwork/p/7456350.html