BeautifulSoup4库

一、BeautifulSoup4

Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库。

Beautiful Soup的三个特点：

Beautiful Soup提供一些简单的方法和python式函数，用于浏览，搜索和修改解析树，它是一个工具箱，通过解析文档为用户提供需要抓取的数据
Beautiful Soup自动将转入稳定转换为Unicode编码，输出文档转换为UTF-8编码，不需要考虑编码，除非文档没有指定编码方式，这时只需要指定原始编码即可
Beautiful Soup位于流行的Python解析器（如lxml和html5lib）之上，允许您尝试不同的解析策略或交易速度以获得灵活性。

安装：

pip install beautifulsoup4

# 镜像下载
pip install beautifulsoup4 -i http://pypi.douban.com/simple --trusted-host pypi.douban.com

pycharm导入该库

import bs4

soup = bs4.BeautifulSoup(html,'lxml')

解析库：

解析器	使用方法	优势	劣势
Python标准库	BeautifulSoup(markup,"html.parser")	Python的内置标准库执行速度适中文档容错能力强	Python 2.7.3 or 3.2.2)前的版本中文档容错能力差
lxml HTML 解析器（***）	BeautifulSoup(markup,"lxml")	速度快文档容错能力强	需要安装C语言库
lxml XML 解析器	`BeautifulSoup(markup,["lxml", "xml"])` `BeautifulSoup(markup,"xml")`	速度快唯一支持XML的解析器	需要安装C语言库
html5lib	BeautifulSoup(markup,"html5lib")	最好的容错性以浏览器的方式解析文档生成HTML5格式的文档	速度慢不依赖外部扩展

HTML源代码：

<html>
 <head>
  <title>今日头条</title>
  <link rel="dns-prefetch" href="//s3.pstatp.com/" />
  <link rel="dns-prefetch" href="//s3a.pstatp.com/" />
  <link rel="dns-prefetch" href="//i.snssdk.com/" />
  <link rel="dns-prefetch" href="//p1.pstatp.com/" />
  <link rel="dns-prefetch" href="//p3.pstatp.com/" />
  <link rel="dns-prefetch" href="//p9.pstatp.com/" />
  <link rel="shortcut icon" href="//p3.pstatp.com/large/113f2000647359d21b305" type="image/x-icon" />
  <meta name="description" content="《今日头条》(TouTiao.com)是一款会自动学习的资讯软件,它会聪明地分析你的兴趣爱好,自动为你推荐喜欢的内容,并且越用越懂你.你关心的,才是头条!" />
  <meta charset="utf-8" />
  <meta name="viewport" content="width=device-width,initial-scale=1,shrink-to-fit=no,viewport-fit=cover,minimum-scale=1,maximum-scale=1,user-scalable=no" />
  <meta http-equiv="x-ua-compatible" content="ie=edge" />
  <meta name="renderer" content="webkit" />
  <meta name="layoutmode" content="standard" />
  <meta name="imagemode" content="force" />
  <meta name="wap-font-scale" content="no" />
  <meta name="format-detection" content="telephone=no" />
  <script>window.__publicUrl__ = '//sf1-scmcdn-tos.pstatp.com/goofy/mobile_share/';</script>
  <script>var group_id = '6817621907341312515';</script>
 </head>
 <body></body>
</html>

解析上面HTML代码：

获取标签名称：

1 soup = bs4.BeautifulSoup(html,'lxml')
2 # 解析标签
3 print(soup.title)  #result: <title>今日头条</title>
4 # 格式化
5 print((soup.prettify()))6 # 获取标签名称
7 print(soup.link.name)

获取标签内容：

1 # 获取title标签内容
2 print(soup.title.string)

获取标签属性：

1 # 获取第一个meta标签name属性,两种方法均可
2 print(soup.meta.attrs["name"])
3 print(soup.meta["name"])

嵌套选择：

print(soup.head.title.string)

HTML2源代码示例：

 1 <html lang="en">
 2 <head>
 3     <meta charset="UTF-8">
 4     <title>BeautifulSoup测试</title>
 5 </head>
 6 <body>
 7     <div id="test">
 8         <p style="align-items: baseline">勤学苦练，终有所成</p>
 9         <p class="scr">网络爬虫所用到的基本库，后续爬虫代码需用此库。</p>
10         <a href="https://www.baidu.com" name="bd">百度</a>
11         <a href="https://www.toutiao.com" name="tt">头条</a>
12         <div id="scrapy" class="one_y">
13             <p class="one_t">百度一下，你就晕了</p>
14             <img src="image/lxl.gif">
15         </div>
16     </div>
17 </body>
18 </html>

子节点和子孙节点：

soup2 = bs4.BeautifulSoup(html2,"lxml")
# 获取P标签下面的子节点，返回list格式
print(soup2.p.contents)
# 获得迭代器
for index,child in enumerate(soup2.p.children):
     print(index,child)

# 输出子节点和子孙节点
for index_2,desc in enumerate(soup2.p.descendants):
     print(index_2,desc)

父节点和祖先节点：

print(soup2.a.parent)

兄弟节点：

1 # 获取a标签下一个节点
2 for index,nes in enumerate(soup2.a.next_siblings):
3     print(index,nes)
4 
5 # 获取a标签上一个节点
6 for inde,nesg in enumerate(soup2.a.previous_siblings):
7     print(inde,nesg)

1、标准选择器

# find_all返回所有元素，find返回单个元素
print(soup2.find_all("a")) # 返回list
print(soup2.find_all(id="scrapy"))

print(soup2.find("a"))
print(soup2.find(id="scrapy"))

2、CSS选择器

# 通过class属性获取div,返回list
print(soup2.select('.one_y'))
# 通过class属性获取div下的p标签,返回list
print(soup2.select('.one_y .one_t'))
# 通过id属性获取div,返回list
print(soup2.select('#test'))

3、获取属性、内容

# 获取标签内容
for item in soup2.select('#test'):
    print(item.get_text())