Xpath语法与lxml库

1. Xpath

1 )什么是XPath？

xpath（XML Path Language）是一门在XML和HTML文档中查找信息的语言，可用来在XML和HTML文档中对元素和属性进行遍历。

2) XPath开发工具

Chrome插件XPath Helper。
Firefox插件Try XPath。

1.1Xpath语法

<?xml version="1.0" encoding="ISO-8859-1"?>

<bookstore>

<book>
  <title lang="eng">Harry Potter</title>
  <price>29.99</price>
</book>

<book>
  <title lang="eng">Learning XML</title>
  <price>39.95</price>
</book>

</bookstore>

xml实例文档

1.1.1 选取节点

XPath 使用路径表达式在 XML 文档中选取节点。节点是通过沿着路径或者 step 来选取的。

下面列出了最有用的路径表达式：

表达式	描述
nodename	选取此节点的所有子节点。
/	如果是在最前面，代表从根节点选取。否则选择某节点下的直接子节点
//	从匹配选择的当前节点选择文档中的节点，而不考虑它们的位置。
.	选取当前节点。
..	选取当前节点的父节点。
@	选取属性。

实例：在下面的表格中，列出了一些路径表达式以及表达式的结果：

路径表达式	结果
bookstore	选取 bookstore 元素的所有子节点。
/bookstore	选取根元素 bookstore。注释：假如路径起始于正斜杠( / )，则此路径始终代表到某元素的绝对路径！
bookstore/book	选取属于 bookstore 的子元素的所有 book 元素。
//book	选取所有 book 子元素，而不管它们在文档中的位置。
bookstore//book	选择属于 bookstore 元素的后代的所有 book 元素，而不管它们位于 bookstore 之下的什么位置。
//book[@lang]	选取所有拥有lang属性的book节点。

1.1.2 谓语（Predicates）

谓语用来查找某个特定的节点或者包含某个指定的值的节点。

谓语被嵌在方括号中。

实例：在下面的表格中，列出了带有谓语的一些路径表达式，以及表达式的结果：

路径表达式	结果
/bookstore/book[1]	选取属于 bookstore 子元素的第一个 book 元素。
/bookstore/book[last()]	选取属于 bookstore 子元素的最后一个 book 元素。
/bookstore/book[last()-1]	选取属于 bookstore 子元素的倒数第二个 book 元素。
/bookstore/book[position()<3]	选取最前面的两个属于 bookstore 元素的子元素的 book 元素。
//title[@lang]	选取所有拥有名为 lang 的属性的 title 元素。
//title[@lang='eng']	选取所有 title 元素，且这些元素拥有值为 eng 的 lang 属性。
/bookstore/book[price>35.00]	选取 bookstore 元素的所有 book 元素，且其中的 price 元素的值须大于 35.00。
/bookstore/book[price>35.00]/title	选取 bookstore 元素中的 book 元素的所有 title 元素，且其中的 price 元素的值须大于 35.00。

1. 1.3 通配符

XPath 通配符可用来选取未知的 XML 元素。

通配符	描述
*	匹配任何元素节点。
@*	匹配任何属性节点。
node()	匹配任何类型的节点。

实例：在下面的表格中，我们列出了一些路径表达式，以及这些表达式的结果：

路径表达式	结果
/bookstore/*	选取 bookstore 元素的所有子元素。
//*	选取文档中的所有元素。
//title[@*]	选取所有带有属性的 title 元素。

1.1.4 选取若干路径

通过在路径表达式中使用“|”运算符，您可以选取若干个路径。

实例：在下面的表格中，我们列出了一些路径表达式，以及这些表达式的结果：

路径表达式	结果
//book/title \| //book/price	选取 book 元素的所有 title 和 price 元素。
//title \| //price	选取文档中的所有 title 和 price 元素。
/bookstore/book/title \| //price	选取属于 bookstore 元素的 book 元素的所有 title 元素，以及文档中所有的 price 元素。

Xpath语法详解文档路径：http://www.w3school.com.cn/xpath/index.asp

2. lxml库

lxml是python的一个解析库，支持HTML和XML的解析，支持XPath解析方式，而且解析效率非常高

2.1 lxml库常用类的属性和方法

object ---+
          |
         _Element

# =====================================
# Properties(属性)
# =====================================

attrib  # 元素属性字典
base  # 原始文档的url或None
sourceline  # 原始行数或None
tag  # tag名
tail  # 尾巴文本(存在于兄弟节点间，属于父节点的文本内容)
text  # 位于第一个子标签之前的子文本
prefix  # 命名空间前缀(XML)(详解见底部附录)
nsmap  # 命名空间与URL映射关系(XML)(详解见底部附录)


# =====================================
# Instance Methods(实例方法)(常用)
# =====================================

xpath(self, _path, namespaces=None, extensions=None, smart_strings=True, **_variables)
# 通过xpath表达式查找指定节点元素，返回指定节点元素列表或None

getparent(self)
# 查找父节点，返回找到的父节点对象或None

getprevious(self)
# 查找前一个相邻的兄弟节点元素，返回找到的节点对象或None

getnext(self)
# 查找后一个相邻的兄弟节点对象，返回找到的节点对象或None

getchildren(self)
# 返回所有直属的子节点对象

getroottree(self)
# 返回所在文档的根节点树

find(self, path, namespaces=None)
# 根据标签名或路径，返回第一个匹配到的子节点对象

findall(self, path, namespaces=None)
# 根据标签名或路径，返回全部符合要求的子节点对象

findtext(self, path, default=None, namespaces=None)
# 根据标签名或路径，返回第一个匹配到的子节点对象的text文本

clear(self)
# 重置节点对象，清除所有子节点对象，以及所有的text、tail对象

get(self, key, default=None)
# 返回节点属性key对应的值

items(self)
# 以任意顺序返回节点属性键和值

keys(self)
# 以任意顺序返回包含节点全部属性名的列表

values(self)
# 以任意顺序返回包含节点全部属性值的列表

set(self, key, value)
# 设置节点属性

Class _Element(顶级基类)

object ---+
          |
   _Element ---+
               |
              ElementBase

# =====================================
Functions(函数)(常用)
# =====================================

HTML(text, parser=None, base_url=None)
# 将字符型HTML文档内容转换为节点树对象

fromstring(text, parser=None, base_url=None)
# 将字符型XML文档或文档片段转换问节点树对象

tostring(element_or_tree, encoding=None, method="xml", xml_declaration=None, pretty_print=False, with_tail=True, standalone=None, doctype=None, exclusive=False, with_commments=True, inclusive_ns_prefixes=None)
# 将节点树对象序列化为编码的字符型

tounicode(element_or_tree, method="xml", pretty_print=False, with_tail=True, doctype=None)
# 将节点树对象序列化为Unicode型

lxml.etree

 object ---+ 
            | 
etree._Element ---+
                  | 
    etree.ElementBase---+ 
                        | 
         object ---+    | 
                   |    |
           HtmlMixin ---+  
                        |
                       HtmlElement


# =====================================
Functions(函数)(常用)
# =====================================

fromstring(html, base_url=None, parser=None, **kwargs)
# 将字符型html文档转换为节点树或文档树

tostring(doc, pretty_print=False, include_meta_content_type=False, encoding=None, method="html", with_tail=True, doctype=None)
# 将节点树或文档树序列化为字符型

######################################
**Class HtmlMixin**

object ---+
          |
          HtmlMixin

# =====================================
Properties(属性)
# =====================================
base_url  # 文档url
head  # <head>标签部分
body  # <body>标签部分
forms  # 返回全部form列表
label  # 元素的label标签
classes  # class属性值的集合

# =====================================
Instance Methods(实例方法)(常用)
# =====================================

drop_tag(self)
# 移除标签，但不移除其子标签和text文本，将其合并到父节点

drop_tree(self)
# 移除节点树（包含子节点和text），但不移除它的tail文本，将其合并到父节点或前一个兄弟节点

find_class(self, class_name)
# 根据class属性值查找节点元素

get_element_by_id(self, rel)
# 根据id属性值查找节点元素

set(self, key, value=None)
# 设置节点元素的属性

text_content(self)
# 返回其后代节点与其自身的全部text内容

lxml.html

2.2 从字符串中解析HTML代码

解析html字符串，使用'lxml.etree.HTML'进行解析。

# 使用 lxml 的 etree 库
from lxml import etree 

text = '''
<div>
    <ul>
         <li class="item-0"><a href="link1.html">first item</a></li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-inactive"><a href="link3.html">third item</a></li>
         <li class="item-1"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a> # 注意，此处缺少一个 </li> 闭合标签
     </ul>
 </div>
'''

#利用etree.HTML，将字符串解析为HTML文档
htmlElementTree = etree.HTML(text) 

# 按字符串序列化HTML文档
result = etree.tostring(htmlElementTree,encoding='utf-8') .decode('utf-8'))

print(result)

View Code

输出结果如下：

<html><body>
<div>
    <ul>
         <li class="item-0"><a href="link1.html">first item</a></li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-inactive"><a href="link3.html">third item</a></li>
         <li class="item-1"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
 </div>
</body></html>

可以看到。lxml会自动修改HTML代码。例子中不仅补全了li标签，还添加了body，html标签。

2.3 从文件中解析html代码

除了直接使用字符串进行解析，lxml还支持从文件中读取内容。我们新建一个hello.html文件：

<!-- hello.html -->
<div>
    <ul>
         <li class="item-0"><a href="link1.html">first item</a></li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-inactive"><a href="link3.html"><span class="bold">third item</span></a></li>
         <li class="item-1"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a></li>
     </ul>
 </div>

解析html文件，使用lxml.etree.parse()进行解析，这个函数默认使用XMLparser解析器，所以如果遇到一些不规范的HTML代码就会解析错误，此时需要自己创建HTMLparser解析器。示例代码如下：

from lxml import etree
# 读取外部文件 hello.html
parser = etree.HTMLParser()#指定解析器HTMLParser,解析时会根据文件修复HTML文件中缺失的信息
htmlElementTree = etree.parse('hello.html',parser = parser) 
result = etree.tostring(htmlElementTree,encoding = 'utf-8',pretty_print=True).decode('utf-8')
print(result)

输出结果和之前是相同的。

2.4 Xpath与lxml结合

#-*-coding = utf-8 -*-
from lxml import etree
import requests
#爬取豆瓣电影热映电影信息
headers = {
    "User-Agent":'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36'
}

response = requests.request(method='get',url='https://movie.douban.com',headers=headers)
text = response.text
parser = etree.HTMLParser()
html = etree.fromstring(text,parser=parser)
ul = html.xpath('//ul[@class="ui-slide-content"]')[0]
li_list = ul.xpath('./li')
move_list = []
for li in li_list:
    if li.xpath('./@data-title')!= []:
        data_title = li.xpath('./@data-title')
        data_release = li.xpath('./@date-release')
        data_rate = li.xpath('./@data-rate')
        data_duration = li.xpath('./@data-duration')
        data_director = li.xpath('./@data-director')
        data_actors = li.xpath('./@data-actors')
        data_postor = li.xpath('.//img/@src')
        data = {
            'data_title':data_title,
            'data_release':data_release,
            'data_rate':data_rate,
            'data_duration':data_duration,
            'data_director':data_director,
            'data_actors':data_actors,
            'data_postor':data_postor
        }
        move_list.append(data)


print(move_list)

爬取豆瓣电影热映电影信息

以下面的xml练习lxml结合Xpath语法查找感兴趣的元素

<?xml version="1.0" encoding="utf8"?>
<bookstore>
    <book>
        <title lang="eng">Harry Potter</title>
        <price>29.99</price>
    </book>
    <book>
        <title lang="eng">Learning XML</title>
        <price>39.95</price>
    </book>
</bookstore>

xml="""<?xml version="1.0" encoding="utf8"?>
<bookstore>
    <book>
        <title lang="eng">Harry Potter</title>
        <price>29.99</price>
    </book>
    <book>
        <title lang="eng">Learning XML</title>
        <price>39.95</price>
    </book>
</bookstore>
"""
#1）得到根节点
root = etree.fromstring(xml.encode('utf-8'))#<Element bookstore at 0x2044cf28e08>
#2）选取所有book子元素，注意xpath()方法返回的是列表
booklist=root.xpath('book')#[<Element book at 0x1bf9d0bddc8>, <Element book at 0x1bf9d0bdd88>]
#3）选取根节点bookstore
bookstore = root.xpath('/bookstore')#[<Element bookstore at 0x2563d6e6ec8>]
#4）选取所有book子元素的title子元素
titlelist1 = root.xpath('/bookstore/book/title')#[<Element title at 0x1ceb6736f48>, <Element title at 0x1ceb6736f88>]
titlelist2 = root.xpath('book/title')#[<Element title at 0x22da6316fc8>, <Element title at 0x22da6333048>]
#5）以根节点为始祖，选取其后代的title元素
titlelist = root.xpath('//title')#[<Element title at 0x195107c3048>, <Element title at 0x195107c3088>]
#6)以book子元素为始祖，选取后代中的price元素
pricelist = root.xpath('book//price')#[<Element price at 0x200d84321c8>, <Element price at 0x200d8432208>]
#7)以根节点为始祖，选取其后代的lang属性值
langValue = root.xpath('//@lang')#['eng', 'eng']
#8）获取bookstore的第一个book子元素
book = root.xpath('/bookstore/book[1]')#[<Element book at 0x25f421920c8>]
#9)获取bookstore的最后一个book子元素
book_last = root.xpath('/bookstore/book[last()]')#[<Element book at 0x1bf133f2048>]
#10)选取bookstore的倒数第二个book子元素
print(root.xpath('/bookstore/book[last()-1]'))#[<Element book at 0x1ff5cbf2088>]
#11)选取bookstore的前两个book子元素
print(root.xpath('/bookstore/book[position()<3]'))#[<Element book at 0x172ac252088>, <Element book at 0x172ac252048>]
#12)以根节点为始祖，选取其后代中含有lang属性的title元素
print(root.xpath('//title[@lang]'))#[<Element title at 0x1a2431cb188>, <Element title at 0x1a2431cb1c8>]
#13)以根节点为始祖，选取其后代中含有lang属性并且其值为eng的title元素
print(root.xpath("//title[@lang='eng']"))#[<Element title at 0x1ac988f1188>, <Element title at 0x1ac988f11c8>]
#14)选取bookstore子元素book，条件是book的price子元素要大于35
print(root.xpath('/bookstore/book[price>35.00]'))#[<Element book at 0x2a907bf1088>]
#15)选取bookstore子元素book的子元素title,条件是book的price子元素要大于35
print(root.xpath('/bookstore/book[price>35.00]/title'))#[<Element title at 0x1f309bf11c8>]
#16）选取bookstore的所有子元素
print(root.xpath('/bookstore/*'))#[<Element book at 0x24fe7e51108>, <Element book at 0x24fe7e510c8>]
#17)选取根节点的所有后代元素
print(root.xpath('//*'))#[<Element bookstore at 0x195e1061188>, <Element book at 0x195e1061108>, <Element title at 0x195e10611c8>, <Element price at 0x195e10612c8>, <Element book at 0x195e10610c8>, <Element title at 0x195e1061208>, <Element price at 0x195e1061308>]
#18）选取根节点的所有具有属性的title元素
print(root.xpath('//title[@*]'))#[<Element title at 0x1eb712c1208>, <Element title at 0x1eb712c1248>]
#19）选取当前节点下的所有节点。'
'是文本节点
print(root.xpath('node()'))#['
    ', <Element book at 0x23822bb1148>, '
    ', <Element book at 0x23822bb1108>, '
']
#20）选取根节点所有后代节点，包括元素、属性、文本
print(root.xpath('//node()'))#[<Element bookstore at 0x2013d601208>, '
    ', <Element book at 0x2013d601188>, '
        ', <Element title at 0x2013d601248>, 'Harry Potter', '
        ', <Element price at 0x2013d601348>, '29.99', '
    ', '
    ', <Element book at 0x2013d601148>, '
        ', <Element title at 0x2013d601288>, 'Learning XML', '
        ', <Element price at 0x2013d601388>, '39.95', '
    ', '
']
#21）选取所有book的title元素或者price元素
print(root.xpath('//book/title|//book/price'))#[<Element title at 0x1c64d751248>, <Element price at 0x1c64d751348>, <Element title at 0x1c64d751288>, <Element price at 0x1c64d751388>]
#22）选取所有的title或者price元素
print(root.xpath('//title|//price'))#[<Element title at 0x212757e1288>, <Element price at 0x212757e1388>, <Element title at 0x212757e12c8>, <Element price at 0x212757e13c8>]

xml_1="""<?xml version="1.0" encoding="utf8"?>
<bookstore>
    <book>
        <title lang="eng">Harry Potter</title>
        <price>29.99</price>
        <content>分部内容
            <part1>
                HarryPotter and the Philosopher's Stone
                    <br>
                    1.大难不死的男孩
                    <br>
                    2.悄悄消失的玻璃
                    <br>
                    3.猫头鹰传书
                    <br>
                    4.钥匙保管员
            </part1>
            <part2>HarryPotter and the Chamber of Secrets</part2>
            <part3>HarryPotter and the Prisoner of Azkaban</part2>
            <part3>HarryPotter and the Prisoner of Azkaban</part2>
        </content>
    </book>
    <book>
        <title lang="eng">Learning XML</title>
        <price>39.95</price>
    </book>
</bookstore>
"""

#23）获取所有price的文本内容
root = etree.fromstring(xml_1.encode('utf-8'),parser=etree.HTMLParser())
#way1
print(root.xpath('//price/text()'))#['29.99', '39.95'],
print(type(root.xpath('//price/text()')[0]))#返回的是一个<class 'lxml.etree._ElementUnicodeResult'>
#way2
price_list = root.xpath('//price')
for price in price_list:
    print(price.xpath("string(.)"))#如果匹配的标签是多个，直接用xpath的string(.)方法会报错，如:root.xpath('//price/string(.)')
    #29.99
    #39.95
print(root.xpath('//content/part1/text()'))#["
                HarryPotter and the Philosopher's Stone
                    ", '
                    1.大难不死的男孩
                    ', '
                    2.悄悄消失的玻璃
                    ', '
                    3.猫头鹰传书
                    ', '
                    4.钥匙保管员
            ']
#24）注意
#1.使用'xpath'语法，应该使用'Element.xpath'方法来选择感兴趣的元素.’xpath函数返回来的永远是一个列表。
#2.获取某个标签的属性:href = html.xpath('//a/@href')
#3.获取某个标签的文本，通过xpath中的'text()'函数，root.xpath('//price/text()')

Xpath练习

#-*-coding = utf-8 -*-
from lxml import etree
import requests

BASE_DOMAIN = 'https://www.dytt8.net'
headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36'
}

def get_detail_urls(url):
    response = requests.request(method='get',url=url,headers=headers)
    html=response.text
    parser = etree.HTMLParser()
    root = etree.fromstring(html,parser=parser)
    movies_url_list = root.xpath('//table[@class="tbspan"]//a/@href')
    #movies_urls = list(map(lambda url:BASE_DOMAIN + url,movies_url_list))
    movies_urls = list(map(lambda url:''.join((BASE_DOMAIN,url)),movies_url_list))
    return movies_urls

def parse_detail_page(url):
    movie = {}
    response = requests.request(method='get',url=url,headers=headers)
    html = response.content.decode('gbk')
    parser = etree.HTMLParser()
    root = etree.fromstring(html, parser=parser)
    title = root.xpath('//h1/font[@color="#07519a"]/text()')[0]
    movie['title'] = title
    zoom = root.xpath('//div[@id="Zoom"]')[0]
    infors = zoom.xpath('.//p/text()')
    for index,infor in enumerate(infors):
        if infor.startswith('◎年　　代'):
            movie['年代'] = infor.replace('◎年　　代','').strip()
        elif infor.startswith('◎产　　地'):
            movie['产地'] = infor.replace('◎产　　地','').strip()
        elif infor.startswith('◎类　　别'):
            movie['类别'] = infor.replace('◎类　　别', '').strip()
        elif infor.startswith('◎语　　言'):
            movie['语言'] = infor.replace('◎语　　言', '').strip()
        elif infor.startswith('◎字　　幕'):
            movie['字幕'] = infor.replace('◎字　　幕', '').strip()
        elif infor.startswith('◎豆瓣评分'):
            movie['豆瓣评分'] = infor.replace('◎豆瓣评分', '').strip()
        elif infor.startswith('◎片　　长'):
            movie['片长'] = infor.replace('◎片　　长', '').strip()
        elif infor.startswith('◎导　　演'):
            movie['导演'] = infor.replace('◎导　　演', '').strip()
        elif infor.startswith('◎主　　演'):
            movie['主演'] = []
            movie['主演'].append(infor.replace('◎主　　演', '').strip())
            for infor in infors[index+1:len(infors)]:
                if infor.startswith('◎'):
                    break
                movie['主演'].append(infor.strip())
        elif infor.startswith('◎简　　介'):
            profile = infor.replace('◎简　　介', '').strip()
            for infor in infors[index+1:len(infors)]:
                profile = profile + infor.strip()
            movie['简介'] = profile
        movie['下载地址'] = root.xpath('//td[@bgcolor = "#fdfddf"]/a/@href')[0]
    return movie

def spider():
    #url = 'https://www.dytt8.net/html/gndy/dyzz/list_23_1.html'
    base_url = 'https://www.dytt8.net/html/gndy/dyzz/list_23_{}.html'
    movies = []
    for i in range(1,2):
        url = base_url.format(i)
        movies_urls = get_detail_urls(url)
        for detail_url in movies_urls:
            movie = parse_detail_page(detail_url)
            movies.append(movie)
    return movies
if __name__ == '__main__':
    movies = spider()
    print(movies)

爬取电影天堂电影信息

>>>>>待续