爬虫_Beautiful Soup

2.2 Beautiful Soup对象介绍与创建

2.2.1 Beautiful Soup对象

代表要解析整个文档树，它支持遍历文档树和搜索文档树中描述的大部分的方法。

2.2.2 创建Beautiful Soup对象

导入模块
创建BeautifulSoup对象

# 1.导入模块
from bs4 import BeautifulSoup

# 2.创建BeautifulSoup对象
soup = BeautifulSoup('<html>data</html>', 'lxml')
print(soup)

2.2.3 find方法

作用：搜索文档树

name：标签名
attrs ：属性字典
recursive：是否递归循环查找
text：根据文本内容查找

返回值：查找到的第一个元素对象

# 1.导入模块
from bs4 import BeautifulSoup
# 2.准备文档字符串
html = '''
    <html lang="en">
    <head>
        <meta charset="UTF-8">
        <meta http-equiv="X-UA-Compatible" content="IE=edge">
        <meta name="viewport" content="width=device-width, initial-scale=1.0">
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="title">
            <b>The Dormouse's story</b>
        </p>
        <p class="story">Once upon a time there were three little sisters;and their names were 
            <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
        </p>
        <p class="story">...</p>
    </body>
    </html>
'''
# 3.创建BeautifulSoup对象
soup = BeautifulSoup(html, 'lxml')
# 4.查找 title 标签
title = soup.find('title')
print(title)
# 5.查找 a 标签
a = soup.find('a')
print(a)

# 查找所有的a标签
a_s = soup.find_all('a')
print(a_s)

# 二、根据属性进行查找
# 查找id为link1的标签
# 方式1：通过命名参数进行指定的
a = soup.find(id='link1')
print(a)
# 方式2“使用attrs来指定属性字典进行查找
a = soup.find(attrs={'id': 'link1'})

# 三、根据文本进行查找
text = soup.find(text='Elsie')
print(text)

2.2.4 Tag对象

Tag对象对应于原始文档中的xml或html标签。Tag有很多方法和属性，可用于遍历文档树和搜索文档树以及获取标签内容。

name：获取标签名称
attrs：获取标签所有属性的键和值
text：获取标签的文本字符串

# Tag对象
print(type(a))
print('标签名', a.name)
print('标签所有属性', a.attrs)
print('标签文本内容', a.text)