PyQuery基本操作介绍

PyQuery为Python提供一个类似于jQuery对HTML的操作方式，可以使用jQuery的语法对html文档进行查询操作。
本文以百度首页为例来介绍PyQuery的一些基本操作。

初始化pyquery

from pyquery import PyQuery as pq

doc = pq(url='http://www.baidu.com')
print(type(doc))

# 获取导航链接的父元素(id='u1')
products = doc('#u1')

print(type(products))

link_index_first = products('a:first')
link_index_last = products('a:last')
link_index_custom = products('a:eq(2)')

print(type(link_index_first))

可以通过PyQuery的text()方法来获取其对应的文字

print(link_index_first.text())
print(link_index_last.text())
print(link_index_custom.text())

糯米
更多产品
hao123

也可以通过PyQuery的attr()方法来获取元素的属性

print(link_index_first.attr('name'))

tj_trnuomi

下面来遍历所有导航按钮。 P.S. 注意此时link的类型是“lxml.html.HtmlElement”

# 遍历所有导航链接，并显示链接的name属性和在网页上显示的文字
links = products('a')
for link in links:
    id_name = link.get('name')
    text = link.text
    print('Name: {0: <15}	Text: {1: <15}'.format(id_name, text))

Name: tj_trnuomi Text: 糯米
Name: tj_trnews Text: 新闻
Name: tj_trhao123 Text: hao123
Name: tj_trmap Text: 地图
Name: tj_trvideo Text: 视频
Name: tj_trtieba Text: 贴吧
Name: tj_login Text: 登录
Name: tj_settingicon Text: 设置
Name: tj_briicon Text: 更多产品

下面介绍一下初始化PyQuery时的另外两种参数

直接转换字符串

d = pq("<html></html>")
d = pq(etree.fromstring("<html></html>"))

读取文件

d = pq(filename=path_to_html_file)

另外，在处理需要编码的文件时可以使用如下的方法：

from lxml.html import HTMLParser, fromstring
UTF8_PARSER = HTMLParser(encoding='utf-8')
with open(page, encoding='utf-8') as filehandler:
    file_contents = filehandler.read()
doc = pq(fromstring(file_contents, parser = UTF8_PARSER))