Python beautifulSoup

BeautifulSoup将复杂HTML文档转换成一个复杂的树形结构,每个节点都是Python对象,所有对象可以归纳为4种:

Tag、NavigableString、BeautifulSoup 、Comment

Tag对象与XML或HTML原生文档中的Tag相同,比如<title>The Dormouse's story </title>或者<a href ="http://example.com/elsie" class="sister" id="link1">Elsie</a>,title和a标记及其里面的内容称为Tag对象

怎么样从soup对象中抽取Tag呢？示例如下html

<html><head><title> The Dormouse's story</title></head>
<body>
<p class="title"><b> The Dormouse's story </b></p>
<p class="story"> Once upon a time there were three little sisters;and their names were

<a href="http://example.com/elsie" class="sister" id='link1'></a>
<a href="http://example.com/lacie" class="sister" id='link2'></a>
<a href="http://example.com/tillie" class="sister" id='link3'>Tillie</a>

and they lived at the bottom of a well</p>

<p class="story">...</p>

</body></html>

#coding:utf-8

from bs4 import BeautifulSoup

soup = BeautifulSoup(open('index.html'),'lxml')

print soup.title #抽取title

print soup.a #抽取 a

print soup.p #抽取p

从例子中可以看到利用soup加标记名就可以获取这些标记的内容。不过利用这种方式,查找的是所有内容中第一个符合要求的标记

Tag中有两个重要的属性：name 和attributes。每个Tag都有自己的名字,通过.name来获取

print soup.p.name

print soup.a.name

Tag不仅可以获取name,还可以修改name,改变之后将影响所有通过当前Beautiful Soup对象生成的HTML文档

soup.title.name = 'mytitle'

print soup.title

print soup.mytitle

这里已经将title标记成功修改为mytitle

Tag中的属性,<p class="title"><b> The Dormouse's story</b></p> 有一个class属性值为title,Tag的属性的操作方法与字典相同:

print soup.p['class']

print soup.p.get('class')

也可以直接点取属性,比如.attrs,用于获取Tag中所有属性:

print soup.p.attrs

和name一样,我们可以对标记中的这些属性和内容进行修改,示例如下

soup.p['class'] = 'myClass'

NavigableString

我们已经得到了标记的内容,要想获取标记内部的文字,需要用到.string示例如下:

print soup.p.string

print type(soup.p.string)

BeautifulSoup用NavigableString类来包装Tag中的字符串,一个NavigableString字符串与Python中的Unicode字符串相同,通过unicode()方法可以直接将NavigableString对象转换成Unicode字符串

BeautifulSoup

BeautifulSoup对象表示的是一个文档的全部内容。大部分时候,可以把它当做Tag对象,是一个特殊的Tag,因为BeautifulSoup对象并不是真正的HTML或XML的标记,所以它没有name和attribute属性

Comment

Tag，NavigableString、BeautifulSoup几乎覆盖了HTML和XML中的所有内容，但是还有一些特殊对象,容易让人担心的内容是文档的注释部分

print soup.a.string

a标记里的内容实际上是注释,但是如果我们利用.string来输出它的内容,会发现它已经把注释符号去掉了。另外如果打印输出它的类型,会发现它是一个Comment类型。如果在我们不清楚这个标记.string的情况下,可能造成数据提取混乱。因此在提取字符串时,可以判断一下类型

if type(soup.a.string) == "bs4.element.Comment":

print soup.a.string

遍历文档树

BeautifulSoup 会将HTML转换为文档树进行搜索,既然是树形结构,节点的概念必不可少

子节点：

首先说一下直接子节点，Tag中的.contents和.children是非常重要的。Tag的.content属性可以将Tag子节点以列表的方式输出

print soup.head.contents

有一点需要注意:字符串没有.contents属性,因为字符串没有子节点 .children属性返回的是一个生成器,可以对Tag的子节点进行循环

for child in soup.head.children:

print(child)

.contents和.children属性仅包含Tag的直接子节点。例如,<head>标记只有一个直接子节点<title>。但是<title>标记也包含一个子节点:字符串"The Dormouse's story",这种情况下字符串也属于<head>标记的子孙节点。.descendants属性可以对所有tag的子孙节点进行

递归循环

for child in soup.head.descendants:

print child

以上都是关于如何获取子节点,接下来说一下如何获取节点的内容,这就涉及.string、strings、stripped_strings三个属性.

.string这个属性很有特点：如果一个标记里面没有标记了,那么.string就会返回标记里面的内容。如果标记里面只有唯一的一个标记了,那么.string也会返回最里面的内容。如果tag包含了多个子节点,tag就无法确定,string方法应该调用那个子节点的内容,

.string的输出结果是None

.strings属性主要应用于tag中包含多个字符串的情况,可以进行循环遍历

for string in soup.strings:

print repr(string)

.stripped_strings和strings类似,.stripped_strings属性可以去掉输出字符串中包含的空格或空行,示例如下:

for string in soup.stripped_strings:

print repr(string)

父节点:

每个Tag或字符串都有父节点：被包含在某个Tag中通过.parent属性来获取某个元素的父节点

print soup.title.parent

通过元素的.parents属性可以递归得到元素的所有父辈节点

for parent in soup.a.parents:

if parent is None:

print(parent)
else:

print(parent.name)

兄弟节点:

兄弟节点可以理解为和本节点处在同一级的节点,.next_sibling属性可以获取该节点的下一个兄弟节点 .previous_sibling则与之相反,如果节点不存在，则返回None

print soup.p.next_sibling

print soup.p.prev_sibling

print soup.p.next_sibling.next_sibling

第一个输出结果为空白,因为空白或者换行也可以被视作一个节点,所以得到的结果可能是空白或者换行

通过.next_siblings和.previous_siblings属性可以对当前节点的兄弟节点迭代输出：

for sibling in soup.a.next_siblings:

print(repr(sibling))

前后节点:

前后节点需要使用.next_element、previous_element这两个属性与.next_sibling .previous_sibling不同，它并不是针对于兄弟节点，而是针对所有节点，不分层次

print soup.head

print soup.head.next_element

如果想遍历所有的前节点或者后节点,通过.next_elements 和.previous_elements 的迭代器就可以向前或向后访问文档的解析内容

for element in soup.a.next_elements:

print(repr(element))

搜索文档树:

BeautifulSoup定义了很多搜素方法,这里着重介绍find_all()方法

find_all(name,attrs,recursive,text,**kwargs)

name 参数可以查找所有名字为name的标记,字符串对象会被自动忽略掉。name参数取值可以是字符串、正则表达式、列表、True和方法,最简单的过滤器是字符串。在搜索方法中传入一个字符串参数,BeautifulSoup会查找与字符串完整匹配的内容

print soup.find_all('b')

如果传入正则表达式作为参数,BeautifulSoup会通过正则表达式的match()来匹配内容。下面的列子中找出所有以b开头的标记

import re

for tag in soup.find_all(re.compile("^b"))

print(tag.name)

如果传入列表参数,Beautiful Soup会将与列表中任一元素匹配的内容返回,下面的代码找到文档中所有<a>标记和<b>标记

print soup.find_all(['a','b'])

如果传入的参数是True,True可以匹配任何值,下面代码查找到所有的tag，但是不会返回字符串节点(只会返回标签)

for tag in soup.find_all(True):
print tag.name

如果没有合适过滤器,那么还可以定义一个方法,方法只接受一个元素参数Tag节点,如果这个方法返回True表示当前元素匹配并且被找到,如果不是则返回False。比如过滤包含class属性,也包含id属性的元素

def hasClass_Id(tag):

return tag.has_attr('class') and tag.has_attr('id')
tag = soup.find_all(hasClass_Id)

print tag

2 kwargs 参数

kwargs参数在python 中表示keyword参数。如果一个指定名字的参数不是搜索内置的参数名，搜索时会把该参数当做指定名字Tag的属性来搜索。搜索指定名字的属性时可以使用的参数值包括字符串、正则表达式、列表、True

如果包含id参数，Beautiful Soup会搜索每个tag的id属性

print soup.find_all(id='link2')

如果出入href参数,BeautifulSoup会搜索每个Tag的href属性。比如查找href属性中含有elsie的tag

print soup.find_all(href = re.compile('elsie'))

下面的代码在文档树中查找所有包含id属性的Tag,无论id的值是什么

print soup.find_all(id = True)

如果我们想用class过滤,但是class是python的关键字,需要在class后面加个下划线:

print soup.find_all('a',class_='sister')

使用多个指定名字的参数可以同时过滤tag的多个属性

print soup.find_all(href = re.compile('elsie'),id = 'link1')

有些tag属性在搜索中不能使用,比如HTML5中的data-*属性

data_soup = BeautifulSoup('<div data-foo="value">foo!</div>')

data_soup.find_all(data-foo = "value")这样的代码在Python中是不合法的,但是可以通过find_all()方法的attrs参数定义一个字典参数来搜索包含特殊属性的tag

data_soup = BeautifulSoup('<div data-foo="value">foo!</div>')

data_soup.find_all(attrs={"data-foo":"value"})

text 参数

通过text参数可以搜索文档中的字符串内容。与name参数的可选值一样,text参数接受字符串、正则表达式、列表、True。

print soup.find_all(text = "Elsie")

print soup.find_all(text = ["Tillie","Elsie","Lacie"])

print soup.find_all(text = re.compile("Dormouse"))

limit 参数

find_all()方法返回全部的搜索结构,如果文档树很大那么搜索会很慢。如果我们不需要全部结果,可以使用limit参数限制返回结果的数量

print soup.find_all('a',limit = 1)

print soup.find_all('a',limit = 3)

recursive参数

调用tag的find_all()方法时,BeautifulSoup会检索当前tag的所有子孙节点,如果只想搜索tag的直接子节点，可以使用参数recursive=Fasle

CSS选择器:

用到的方法是soup.select(),返回类型是list

1:通过标记名称进行查找

通过标记名称可以直接查找、逐层查找,也可以找到某个标记下的直接子标记和兄弟节点标记

#直接查找title标记
print soup.select('title')

#逐层查找title标记

print soup.select('html head title')

#查找直接子节点
#查找head下的title标记

print soup.select("head > title")

#查找p下的id =‘link1’的标记

print soup.select("p > #link1")

#查找兄弟节点
#查找id=link1之后class=sister的所有兄弟标记

print soup.select("#link1 ~ .sister")

#查找紧跟着id = "link1"之后class=sister的子标记
print soup.select("#link1 + .sister")