Python爬虫学习笔记(六)

BS4：

参考文档：

https://beautifulsoup.readthedocs.io/zh_CN/v4.4.0/

Test1（简单使用）：

文本代码：

"""
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

测试代码：

# coding=gbk
from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

# 1.转类型
# 默认bs4会调用系统中lxml的解析库 => 警告提示
# 主动设置 bs4解析库
soup = BeautifulSoup(html_doc, 'lxml')

# 2.格式化输出（补全）
result = soup.prettify()
print(result)

E:Python3.9python.exe H:/code/Python爬虫/Day07/01-beautiful_soup.py
<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>

Process finished with exit code 0

View Code

Test2（读取内容）：

代码：

# coding=gbk
from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

# 1.转类型
# 默认bs4会调用系统中lxml的解析库 => 警告提示
# 主动设置 bs4解析库
soup = BeautifulSoup(html_doc, 'lxml')

# 2.解析数据
result1 = soup.head
result2 = soup.p
result3 = soup.a
print(result1)
print(result2)
print(result3)

# 3.读取内容
result4 = soup.a.string
print(result4)
# 4.读取属性
result5 = soup.a['href']
print(result5)

<head><title>The Dormouse's story</title></head>
<p class="title"><b>The Dormouse's story</b></p>
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
Elsie
http://example.com/elsie

注：

由返回结果可知，读取标签时只能读取第一个目标标签

Test3（四大对象）：

四大对象：

Beautiful Soup将复杂HTML文档转换成一个复杂的树形结构,
每个节点都是Python对象,
所有对象可以归纳为4种: 
Tag , NavigableString , BeautifulSoup , Comment .

代码：

# coding=gbk
from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="story"><!--s1mpL3...--></p>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

"""
# 1.转类型
# 默认bs4会调用系统中lxml的解析库 => 警告提示
# 主动设置 bs4解析库
soup = BeautifulSoup(html_doc, 'lxml')
print(type(soup))
# 2.解析数据
# Tag标签对象 bs4.element.Tag
result1 = soup.head
result2 = soup.p.string
print(result2)
result3 = soup.a
print(type(result1))

# 注释的内容类型 => bs4.element.Comment
print(type(result2))

print(type(result3))

# 3.读取内容 NavigableString
result4 = soup.a.string
print(type(result4))

# 4.读取属性
result5 = soup.a['href']
print(type(result5))
print(type(soup))

<class 'bs4.BeautifulSoup'>
s1mpL3...
<class 'bs4.element.Tag'>
<class 'bs4.element.Comment'>
<class 'bs4.element.Tag'>
<class 'bs4.element.NavigableString'>
<class 'str'>
<class 'bs4.BeautifulSoup'>

Test4（通用方法 - find()）：

概述：

find -- 返回符合查询条件的第一个标签对象

代码：

# coding=gbk
from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="story"><!--s1mpL3...--></p>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

"""
# 1.转类型
soup = BeautifulSoup(html_doc, 'lxml')
# 2.通用解析方法
# find -- 返回符合查询条件的第一个标签
result1 = soup.find(name="p")
result2 = soup.find(attrs={"class": "title"})
result3 = soup.find(text="Tillie")
result4 = soup.find(
    name="p",
    attrs={"class": "title"},
)
print(result1)
print(result2)
print(result3)
print(result4)

<p class="story"><!--s1mpL3...--></p>
<p class="title"><b>The Dormouse's story</b></p>
Tillie
<p class="title"><b>The Dormouse's story</b></p>

Test5（通用方法 - find_all()）：

概述：

findall -- 返回列表(list)标签对象

代码：

# coding=gbk
from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="story"><!--s1mpL3...--></p>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

"""
# 1.转类型
soup = BeautifulSoup(html_doc, 'lxml')
# 2.通用解析方法
# findall -- 返回列表(list)标签对象
result1 = soup.find_all('a')
result2 = soup.find_all("a", limit=1)[0]  # 该写法即为find()方法的源码
result3 = soup.find_all(attrs={"class": "sister"})

print(result1)
print(result2)
print(result3)

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

Test6（通用方法 - select_one()）：

概述：

select_one -- CSS选择器

代码：

# coding=gbk
from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="story"><!--s1mpL3...--></p>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

"""
# 1.转类型
soup = BeautifulSoup(html_doc, 'lxml')
# 2.通用解析方法
# find -- 返回符合查询条件的第一个标签# select_one -- CSS选择器
# 查看该函数源码可知有limit限制，即limit=1
result1 = soup.select_one('.sister')

print(result1)

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

Test7（通用方法 - select()）：

概述：

select -- CSS选择器(list)

代码：

# coding=gbk
from bs4 import BeautifulSoup

html_doc = """
<html><head><title id=one>The Dormouse's story</title></head>
<body>
<p class="story"><!--s1mpL3...--></p>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

"""
# 1.转类型
soup = BeautifulSoup(html_doc, 'lxml')
# 2.通用解析方法
# select -- CSS选择器(list)
result1 = soup.select('.sister')
result2 = soup.select('#one')
result3 = soup.select('head title')
result4 = soup.select('title, .title')
result5 = soup.select('a[id="link3"]')

print(result1)
print(result2)
print(result3)
print(result4)
print(result5)

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
[<title id="one">The Dormouse's story</title>]
[<title id="one">The Dormouse's story</title>]
[<title id="one">The Dormouse's story</title>, <p class="title"><b>The Dormouse's story</b></p>]
[<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

Test8（通用方法 - get_text()）：

代码：

# coding=gbk
from bs4 import BeautifulSoup

html_doc = """
<html><head><title id=one>The Dormouse's story</title></head>
<body>
<p class="story"><!--s1mpL3...--></p>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

"""
# 1.转类型
soup = BeautifulSoup(html_doc, 'lxml')
# 2.通用解析方法

# 标签包裹内容 --- list
result1 = soup.select('b')[0].get_text()
# 标签的属性
result2 = soup.select('#link1')[0].get('href')

print(result1)
print(result2)

The Dormouse's story
http://example.com/elsie

XML：

数据交互格式：

前端，移动端和后台交互的数据格式

参数：

服务器，[ ]，dict = {}

key = value

<key>value</key>

Python爬虫学习笔记(六)

BS4：

参考文档：

Test1（简单使用）：

文本代码：

测试代码：

返回：

Test2（读取内容）：

代码：

返回：

注：

Test3（四大对象）：

四大对象：

代码：

返回：

Test4（通用方法 - find()）：

概述：

代码：

返回：

Test5（通用方法 - find_all()）：

概述：

代码：

返回：

Test6（通用方法 - select_one()）：

概述：

代码：

返回：

Test7（通用方法 - select()）：

概述：

代码：

返回：

Test8（通用方法 - get_text()）：

代码：

返回：

XML：

数据交互格式：

参数：