数据解析

lxml库

lxml 是一个HTML/XML的解析器，其是由C语言来实现的，主要的功能是如何解析和提取 HTML/XML 数据。

基本使用

我们可以利用它来解析HTML代码，并且在解析HTML代码的时候，如果HTML代码不规范，它会自动的进行补全。

from lxml import etree
text="""
<div>
    <ul>
         <li class="item-0"><a href="link1.html">first item</a></li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-inactive"><a href="link3.html">third item</a></li>
         <li class="item-1"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a> # 注意，此处缺少一个 </li> 闭合标签
     </ul>
 </div>
 """
##实例化一个html对象，将字符串解析为HTML文档
html = etree.HTML(text)
##按字符串序列化HTML文档
res = etree.tostring(html)
print(res)

读取HTML的文档

我们先建立一个叫做pacong.html的HTML文档，里面放入以下内容：

<div>
    <ul>
         <li class="item-0"><a href="link1.html">first item</a></li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-inactive"><a href="link3.html">third item</a></li>
         <li class="item-1"><a href="link4.html">fourth item</a></li>
        <li class="item-0"><a href="link5.html">fifth item</a></li>
     </ul>
 </div>

然后我们读取该文档的代码如下：

from lxml import etree
##用parse()函数读取该文档
html = etree.parse("pacong.html")
res=etree.tostring(html,encoding="utf-8").decode("utf-8")
print(res)

爬取豆瓣热门电影实例

from lxml import etree
import requests
headers = {
    "Referer":"https://movie.douban.com/explore",
    "User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36"
}
html = requests.get("https://movie.douban.com/chart",headers=headers)
text = html.text
html_texts=etree.HTML(text)
movies_names_list = html_texts.xpath("//table//@title")
movie_info=[]
for movie_name in movies_names_list:
    base_info = html_texts.xpath("//table//p/text()")[0]
    stars = html_texts.xpath("""//table//span[@class="rating_nums"]/text()""")[0]
    url = html_texts.xpath("//table//@src")[0]
    info_movie={
        "name":movie_name,
        "base_info":base_info,
        "stars":stars,
        "poster":url
    }
    movie_info.append(info_movie)
print(movie_info)

爬取电影天堂资源实例

from lxml import etree
import requests

HEADERS={
    "User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36"
}
url = "https://www.dy2018.com/html/bikan/index.html"
scrapy_url_list = []
scrapy_url_list.append(url)
for i in range(2,3):
    scrapy_url_list.append("https://www.dy2018.com/html/bikan/index_{}.html".format(i))
for i in scrapy_url_list:
    html = requests.get(i, headers=HEADERS)
    html = html.text
    html_text = etree.HTML(html)
    moive_lists = html_text.xpath("//table//b/a[2]/@href")
    for i_1 in moive_lists:
        movie_actors = []
        movie_infos = {}
        details_movie = etree.HTML(requests.get("https://www.dy2018.com"+i_1,headers=HEADERS).content.decode("gbk"))
        movie_infos["海报"] = details_movie.xpath("//p//@src")[0]
        movie_infos["译名"] = details_movie.xpath("//p[2]/text()")[0].replace("◎译　　名","").strip()
        movie_infos["片名"] = details_movie.xpath("//p[3]/text()")[0].replace("◎片　　名","").strip()
        movie_infos["类别"] = details_movie.xpath("//p[6]/text()")[0].replace("◎类　　别","").strip()
        movie_infos["豆瓣评分"] = details_movie.xpath("//p[10]/text()")[0].replace("◎豆瓣评分","").strip()
        movie_infos["片长"] = details_movie.xpath("//p[15]/text()")[0].replace("◎片　　长","").strip()
        movie_infos["导演"] = details_movie.xpath("//p[16]/text()")[0].replace("◎导　　演","").strip()
        for i in range(17,100):
            if  details_movie.xpath("//p[{}]/text()".format(i))[0].endswith("◎简　　介"):
                break
            elif i == 17:
                movie_actors.append(details_movie.xpath("//p[{}]/text()".format(i))[0].replace("◎主　　演","").strip())
            elif details_movie.xpath("//p[{}]/text()".format(i))[0].startswith("◎简　　介"):
                movie_infos["简介"] = movie_profilo = details_movie.xpath("//p[{}]/text()".format(i+1))[0]
            else:
                movie_actors.append(details_movie.xpath("//p[{}]/text()".format(i))[0].strip())
        movie_infos["主演"] = movie_actors
        print(movie_infos)

`BeautifulSoup4`库

中文文档
BeautifulSoup4也是一个HTML/XMl解析器，不过与lxml只会局部遍历不同，BeautifulSoup4是基于HTML DOM（Document Object Model）的，会载入整个文档，解析整个DOM树，因此时间和内存开销都会大很多，所以性能要低于lxml。

解析工具与使用

BeautifulSoup4第一个参数是被解析的文档字符串或是文件句柄，第二个参数用来标识怎样解析文档。如果如果第二个参数为空,那么Beautiful Soup根据当前系统安装的库自动选择解析器,解析器的优先数序: lxml,html5lib, Python标准库。而目前只有 lxml 解析器支持XML文档的解析。

不同解析工具的一些区别

解析器	使用方法	优势	劣势
Python标准库	`BeautifulSoup(markup, "html.parser")`	Python的内置标准库、执行速度适中、文档容错能力强	Python 2.7.3 or 3.2.2)前的版本中文档容错能力差
lxml HTML 解析器	`BeautifulSoup(markup, "lxml")`	速度快、文档容错能力强	需要安装C语言库
lxml XML 解析器	`BeautifulSoup(markup, ["lxml", "xml"])`或`BeautifulSoup(markup, "xml")`	速度快、唯一支持XML的解析器	需要安装C语言库
html5lib	`BeautifulSoup(markup, "html5lib")`	最好的容错性、以浏览器的方式解析文档、生成HTML5格式的文档	速度慢、不依赖外部扩展

##我们同text = "<a><b /></a>"为例子

################################################################
BeautifulSoup("<a><b /></a>")
# <html><head></head><body><a><b></b></a></body></html>
#因为<b />不符合HTMl的标准，所以被解析为<b>标签
BeautifulSoup("<a><b /></a>", "xml")
# <?xml version="1.0" encoding="utf-8"?>
# <a><b/></a>
而xml可以解析xml，所以解析出来完全一样
#################################################################
#如果HTML文档是标准的，那么所有的HTML解析器，解析出来的都是一样的。
BeautifulSoup("<a></p>", "lxml")
# <html><body><a></a></body></html>
BeautifulSoup("<a></p>", "html5lib")
# <html><head></head><body><a><p></p></a></body></html>
BeautifulSoup("<a></p>", "html.parser")
# <a></a>

使用

from bs4 import BeautifulSoup
text = """
<html><head><title>The Dormouse's story</title></head><body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
##我们也可以使用open()函数来获取文件句柄传入
########具体过程############
#该文档会被编码从Unicode类型#
###########################
soup = BeautifulSoup(text,"lxml") #实例化
print(soup.prettify())  #prettify()方法是取得内容

类型

Tag

Tag对象与XML或HTML原生文档中的tag相同，我们可以简单地把Tag理解为 HTML 中的一个个标签。那么，它自然就有了Name和Attributes两个非常重要的属性:
Name:每个tag都有自己的名字，我们通过tag.name来获取，例如：

soup = BeautifulSoup(text,"lxml")
print(soup.p)
'''
result:
<p class="title"><b>The Dormouse's story</b></p>
'''

当然我们也可以使用tag.name = "text"来改变文档内容，但是，我们不建议。
Attributes：我们可以使用tag["class"]或tag.attrs来获取标签属性的值，例如：

print(soup.p["class"])
#['title']
print(soup.p.attrs)
#{'class': ['title']}

另外我们可以像操作字典一样添加、修改甚至删除tag的属性，但是，我们不建议。
另外有一些标签具有多重属性，我们获取的时候会返回一个数组，例如：

css_soup = BeautifulSoup('<p class="body strikeout"></p>')
css_soup.p['class']
# ["body", "strikeout"]

如果某个属性看起来好像有多个值,但在任何版本的HTML定义中都没有被定义为多值属性,那么Beautiful Soup会将这个属性作为字符串返回，例如：

id_soup = BeautifulSoup('<p id="my id"></p>')
id_soup.p['id']
# 'my id'

NavigableString

如果拿到标签后，还想获取标签中的内容，那么可以通过tag.string等来获取标签中的文字。一个NavigableString字符串与Python中的Unicode字符串相同,并且还支持包含在遍历文档树和搜索文档树中的一些特性。

soup = BeautifulSoup(text,"lxml")
print(type(soup.strings))

BeautifulSoup

BeautifulSoup对象表示的是一个文档的全部内容。大部分时候,可以把它当作Tag对象,它支持遍历文档树和搜索文档树中描述的大部分的方法。因为 BeautifulSoup对象并不是真正的HTML或XML的tag,所以它没有name和attribute属性.但有时查看它的属性是很方便的,所以BeautifulSoup 对象包含了一个值为"[document]"的特殊属性.name。

Comment

markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>"
soup = BeautifulSoup(markup)
comment = soup.b.string
type(comment)
# <class 'bs4.element.Comment'>

子节点

一个Tag可能包含多个字符串或其它的Tag,这些都是这个Tag的子节点。Beautiful Soup提供了许多操作和遍历子节点的属性。

使用tag的名字

soup = BeautifulSoup(text,"lxml")
print(soup.p.b)
#<b>The Dormouse's story</b>

`.contents`和`.children`

.contents属性可以将tag的子节点以列表的方式输出，另外字符串没有该属性：

e_text = """
<html><head><title>The Dormouse's story</title></head><body>
<p class="title"><b>The Dormouse's story</b></p>
"""
soup = BeautifulSoup(e_text,"lxml")
print(soup.contents)
'''
result:
[<html><head><title>The Dormouse's story</title></head><body>
<p class="title"><b>The Dormouse's story</b></p>
</body></html>]
'''

通过tag的.children生成器,可以对tag的子节点进行循环:

e_text = """
<html><head><title>The Dormouse's story</title></head><body>
<p class="title"><b>The Dormouse's story</b></p>
"""
soup = BeautifulSoup(e_text,"lxml")
for s in soup.children:
    print(s)
'''
result:
<html><head><title>The Dormouse's story</title></head><body>
<p class="title"><b>The Dormouse's story</b></p>
</body></html>
'''

`.descendants`

.descendants属性可以对所有tag的子孙节点进行递归循环。

`.string`

获取该标签的内容，但是只能有一个文本内容，例如：

e_text = """
<html><head><title>The Dormouse's story</title></head><body>
<p class="title"><b>The Dormouse's story</b></p></html>
"""
soup = BeautifulSoup(e_text,"lxml")
print(soup.head.string)
'''
result:
The Dormouse's story
'''

`.strings`和`stripped_strings`

.strings可以获取所有的内容，stripped_strings不仅可以获取所有的内容，还可以去掉所有的空格，它们都返回数组：

e_text = """
<html><head><title>The Dormouse's story</title></head><body>
<p class="title"><b>The Dormouse's story</b></p></html>
"""
soup = BeautifulSoup(e_text,"lxml")
print(list(soup.strings))
'''
result:["The Dormouse's story", '
', "The Dormouse's story", '
']
'''

e_text = """
<html><head><title>The Dormouse's story</title></head><body>
<p class="title"><b>The Dormouse's story</b></p></html>
"""
soup = BeautifulSoup(e_text,"lxml")
print(list(soup.stripped_strings))
'''
["The Dormouse's story", "The Dormouse's story"]
'''

父节点

我们可以通过.parent和.parents获取某个元素的父节点，第一个只能获取第一层的，而第二个可以获取所有的父节点：

e_text = """
<html><head><title>The Dormouse's story</title></head><body>
<p class="title"><b>The Dormouse's story</b></p></html>
"""
soup = BeautifulSoup(e_text,"lxml")
child_node = soup.b
print(child_node,child_node.parent)
'''
result:<b>The Dormouse's story</b> <p class="title"><b>The Dormouse's story</b></p>
'''

e_text = """
<html><head><title>The Dormouse's story</title></head><body>
<p class="title"><b>The Dormouse's story</b></p></html>
"""
soup = BeautifulSoup(e_text,"lxml")
child_node = soup.b
print(child_node)
for c_n in child_node.parents:
    print(c_n)
'''
result:
<b>The Dormouse's story</b>
<p class="title"><b>The Dormouse's story</b></p>
<body>
<p class="title"><b>The Dormouse's story</b></p></body>
<html><head><title>The Dormouse's story</title></head><body>
<p class="title"><b>The Dormouse's story</b></p></body></html>
<html><head><title>The Dormouse's story</title></head><body>
<p class="title"><b>The Dormouse's story</b></p></body></html>
'''

搜索文档树

`find`和`find_all`方法

find方法找到第一个满足条件的标签后就会返回，find_all方法会找寻所有满足条件的标签，然后全部返回。find_all()方法返回全部的搜索结构,如果文档树很大那么搜索会很慢.如果我们不需要全部结果,可以使用 limit 参数限制返回结果的数量。我们仅以find_all方法为例：

e_text = """
<html><head><title>The Dormouse's story</title></head><body>
<p class="title"><b>The Dormouse's story</b></p></html>
"""
soup = BeautifulSoup(e_text,"lxml")
res = soup.find_all("p",class_ = "title") #当遇到class时，使用"class_"
#res = soup.find_all("p",attrs = {"class":"title"})
print(res)
#[<p class="title"><b>The Dormouse's story</b></p>]

CSS选择器

有时候使用css选择器的方式可以更加的方便。使用css选择器的语法，应该使用select方法。

1.通过标签名查找：

print(soup.select('a'))

2.通过类名查找：
通过类名，则应该在类的前面加一个.。比如要查找class=sister的标签

print(soup.select('.sister'))

3.通过id查找：
通过id查找，应该在id的名字前面加一个＃号。

print(soup.select("#link1")

4.组合查找：
组合查找即和写 class 文件时，标签名与类名、id名进行的组合原理是一样的。

print(soup.select("p #link1"))

直接子标签查找，则使用 > 分隔:

print(soup.select("head > title"))

5.通过属性查找:
查找时还可以加入属性元素，属性需要用中括号括起来，注意属性和标签属于同一节点，所以中间不能加空格，否则会无法匹配到。

print(soup.select('a[href="http://example.com/elsie"]'))

6.以上的select方法返回的结果都是列表形式，可以遍历形式输出，然后用get_text()方法来获取它的内容。

soup = BeautifulSoup(html, 'lxml')
print(type(soup.select('title')))
print(soup.select('title')[0].get_text())

for title in soup.select('title'):
    print(title.get_text())