Python爬虫学习笔记(六)

BS4:

参考文档:

https://beautifulsoup.readthedocs.io/zh_CN/v4.4.0/

Test1(简单使用):

文本代码:

"""
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

测试代码:

# coding=gbk
from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

# 1.转类型
# 默认bs4会调用系统中lxml的解析库 => 警告提示
# 主动设置 bs4解析库
soup = BeautifulSoup(html_doc, 'lxml')

# 2.格式化输出(补全)
result = soup.prettify()
print(result)

返回:

E:Python3.9python.exe H:/code/Python爬虫/Day07/01-beautiful_soup.py
<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>

Process finished with exit code 0
View Code

Test2(读取内容):

代码:

# coding=gbk
from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

# 1.转类型
# 默认bs4会调用系统中lxml的解析库 => 警告提示
# 主动设置 bs4解析库
soup = BeautifulSoup(html_doc, 'lxml')

# 2.解析数据
result1 = soup.head
result2 = soup.p
result3 = soup.a
print(result1)
print(result2)
print(result3)

# 3.读取内容
result4 = soup.a.string
print(result4)
# 4.读取属性
result5 = soup.a['href']
print(result5)

返回:

<head><title>The Dormouse's story</title></head>
<p class="title"><b>The Dormouse's story</b></p>
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
Elsie
http://example.com/elsie

注:

由返回结果可知,读取标签时只能读取第一个目标标签

Test3(四大对象):

四大对象:

Beautiful Soup将复杂HTML文档转换成一个复杂的树形结构,
每个节点都是Python对象,
所有对象可以归纳为4种: 
Tag , NavigableString , BeautifulSoup , Comment .

代码:

# coding=gbk
from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="story"><!--s1mpL3...--></p>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

"""
# 1.转类型
# 默认bs4会调用系统中lxml的解析库 => 警告提示
# 主动设置 bs4解析库
soup = BeautifulSoup(html_doc, 'lxml')
print(type(soup))
# 2.解析数据
# Tag标签对象 bs4.element.Tag
result1 = soup.head
result2 = soup.p.string
print(result2)
result3 = soup.a
print(type(result1))

# 注释的内容类型 => bs4.element.Comment
print(type(result2))

print(type(result3))

# 3.读取内容 NavigableString
result4 = soup.a.string
print(type(result4))

# 4.读取属性
result5 = soup.a['href']
print(type(result5))
print(type(soup))

返回:

<class 'bs4.BeautifulSoup'>
s1mpL3...
<class 'bs4.element.Tag'>
<class 'bs4.element.Comment'>
<class 'bs4.element.Tag'>
<class 'bs4.element.NavigableString'>
<class 'str'>
<class 'bs4.BeautifulSoup'>

Test4(通用方法 - find()):

概述:

find -- 返回符合查询条件的第一个标签对象

代码:

# coding=gbk
from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="story"><!--s1mpL3...--></p>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

"""
# 1.转类型
soup = BeautifulSoup(html_doc, 'lxml')
# 2.通用解析方法
# find -- 返回符合查询条件的第一个标签
result1 = soup.find(name="p")
result2 = soup.find(attrs={"class": "title"})
result3 = soup.find(text="Tillie")
result4 = soup.find(
    name="p",
    attrs={"class": "title"},
)
print(result1)
print(result2)
print(result3)
print(result4)

返回:

<p class="story"><!--s1mpL3...--></p>
<p class="title"><b>The Dormouse's story</b></p>
Tillie
<p class="title"><b>The Dormouse's story</b></p>

Test5(通用方法 - find_all()):

概述:

findall -- 返回列表(list)标签对象

代码:

# coding=gbk
from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="story"><!--s1mpL3...--></p>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

"""
# 1.转类型
soup = BeautifulSoup(html_doc, 'lxml')
# 2.通用解析方法
# findall -- 返回列表(list)标签对象
result1 = soup.find_all('a')
result2 = soup.find_all("a", limit=1)[0]  # 该写法即为find()方法的源码
result3 = soup.find_all(attrs={"class": "sister"})

print(result1)
print(result2)
print(result3)

返回:

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

Test6(通用方法 - select_one()):

概述:

select_one -- CSS选择器

代码:

# coding=gbk
from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="story"><!--s1mpL3...--></p>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

"""
# 1.转类型
soup = BeautifulSoup(html_doc, 'lxml')
# 2.通用解析方法
# find -- 返回符合查询条件的第一个标签# select_one -- CSS选择器
# 查看该函数源码可知有limit限制,即limit=1
result1 = soup.select_one('.sister')

print(result1)

返回:

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

Test7(通用方法 - select()):

概述:

select -- CSS选择器(list)

代码:

# coding=gbk
from bs4 import BeautifulSoup

html_doc = """
<html><head><title id=one>The Dormouse's story</title></head>
<body>
<p class="story"><!--s1mpL3...--></p>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

"""
# 1.转类型
soup = BeautifulSoup(html_doc, 'lxml')
# 2.通用解析方法
# select -- CSS选择器(list)
result1 = soup.select('.sister')
result2 = soup.select('#one')
result3 = soup.select('head title')
result4 = soup.select('title, .title')
result5 = soup.select('a[id="link3"]')

print(result1)
print(result2)
print(result3)
print(result4)
print(result5)

返回:

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
[<title id="one">The Dormouse's story</title>]
[<title id="one">The Dormouse's story</title>]
[<title id="one">The Dormouse's story</title>, <p class="title"><b>The Dormouse's story</b></p>]
[<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

Test8(通用方法 - get_text()):

代码:

# coding=gbk
from bs4 import BeautifulSoup

html_doc = """
<html><head><title id=one>The Dormouse's story</title></head>
<body>
<p class="story"><!--s1mpL3...--></p>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

"""
# 1.转类型
soup = BeautifulSoup(html_doc, 'lxml')
# 2.通用解析方法
# 标签包裹内容 --- list result1 = soup.select('b')[0].get_text() # 标签的属性 result2 = soup.select('#link1')[0].get('href') print(result1) print(result2)

返回:

The Dormouse's story
http://example.com/elsie

XML:

数据交互格式:

前端,移动端和后台交互的数据格式

参数:

服务器,[ ],dict = {}

key = value

<key>value</key>

原文地址:https://www.cnblogs.com/3cH0-Nu1L/p/14487352.html