七月在线爬虫班学习笔记（三）——爬虫基础知识与简易爬虫实现

第三课的主要内容有:

CSS
XPATH
JSON及xml的处理
正则表达式
selenium简介及实战

css例子

以下四个html页面在浏览器中打开即可看到效果。

css_background_color.html:

<html>
<head>

<style type="text/css">

body {background-color: yellow}
h1 {background-color: #00ff00}
h2 {background-color: transparent}
p {background-color: rgb(250,0,255)}
p.no2 {background-color: gray; padding: 20px;}

</style>

</head>

<body>

<h1>这是标题 1</h1>
<h2>这是标题 2</h2>
<p>这是段落</p>
<p class="no2">这个段落设置了内边距。</p>

</body>
</html>

css_board_color.html:

<html>
<head>

<style type="text/css">
p.one
{
border-style: solid;
border-color: #0000ff
}
p.two
{
border-style: solid;
border-color: #ff0000 #0000ff
}
p.three
{
border-style: solid;
border-color: #ff0000 #00ff00 #0000ff
}
p.four
{
border-style: solid;
border-color: #ff0000 #00ff00 #0000ff rgb(250,0,255)
}
</style>

</head>

<body>

<p class="one">One-colored border!</p>

<p class="two">Two-colored border!</p>

<p class="three">Three-colored border!</p>

<p class="four">Four-colored border!</p>

<p><b>注释：</b>"border-width" 属性如果单独使用的话是不会起作用的。请首先使用 "border-style" 属性来设置边框。</p>

</body>
</html>

css_font_family.html:

<html>
<head>
<style type="text/css">
p.serif{font-family:"Times New Roman",Georgia,Serif}
p.sansserif{font-family:Arial,Verdana,Sans-serif}
</style>
</head>

<body>
<h1>CSS font-family</h1>
<p class="serif">This is a paragraph, shown in the Times New Roman font.</p>
<p class="sansserif">This is a paragraph, shown in the Arial font.</p>

</body>
</html>

css_text_decoration.html:

<html>
<head>
<style type="text/css">
h1 {text-decoration: overline}
h2 {text-decoration: line-through}
h3 {text-decoration: underline}
h4 {text-decoration:blink}
a {text-decoration: none}
</style>
</head>

<body>
<h1>这是标题 1</h1>
<h2>这是标题 2</h2>
<h3>这是标题 3</h3>
<h4>这是标题 4</h4>
<p><a href="http://www.w3school.com.cn/index.html">这是一个链接</a></p>
</body>

</html>

解析xml，下面是课程中使用到的book.xml:

<?xml version="1.0" encoding="ISO-8859-1"?>
<bookstore>
    <book>
        <title lang="eng">Harry Potter</title>
        <price>29.99</price>
    </book>
    <book>
        <title lang="eng">Learning XML</title>
        <price>39.95</price>
    </book>
</bookstore>

Python处理XML方法之DOM:

from xml.dom import minidom

doc = minidom.parse('book.xml')
root = doc.documentElement
# print(dir(root))
print(root.nodeName)
books = root.getElementsByTagName('book')
print(type(books))
for book in books:
    titles = book.getElementsByTagName('title')
    print(titles[0].childNodes[0].nodeValue)



#results
bookstore
<class 'xml.dom.minicompat.NodeList'>
Harry Potter
Learning XML

Python处理XML方法之SAX:

 1 import string
 2 from xml.parsers.expat import ParserCreate
 3 
 4 class DefaultSaxHandler(object):
 5     def start_element(self, name, attrs):
 6         self.element = name
 7         print('element: %s, attrs: %s' % (name, str(attrs)))
 8 
 9     def end_element(self, name):
10         print('end element: %s' % name)
11 
12     def char_data(self, text):
13         if text.strip():
14             print("%s's text is %s" % (self.element, text))
15 
16 handler = DefaultSaxHandler()
17 parser = ParserCreate()
18 parser.StartElementHandler = handler.start_element
19 parser.EndElementHandler = handler.end_element
20 parser.CharacterDataHandler = handler.char_data
21 with open('book.xml', 'r') as f:
22     parser.Parse(f.read())

 1 element: bookstore, attrs: {}
 2 element: book, attrs: {}
 3 element: title, attrs: {'lang': 'eng'}
 4 title's text is Harry Potter
 5 end element: title
 6 element: price, attrs: {}
 7 price's text is 29.99
 8 end element: price
 9 end element: book
10 element: book, attrs: {}
11 element: title, attrs: {'lang': 'eng'}

 1 010-12345
 2 0 9
 3 分组
 4 ('010', '12345')
 5 010-12345
 6 010
 7 12345
 8 分割
 9 <class '_sre.SRE_Pattern'>
10 ['one', 'two', 'three', 'four', '']
11 ('20', '15', '45')

12 title's text is Learning XML
13 end element: title
14 element: price, attrs: {}
15 price's text is 39.95
16 end element: price
17 end element: book
18 end element: bookstore

实例：

 1 import re
 2 
 3 m = re.match(r'd{3}-d{3,8}', '010-12345')
 4 # print(dir(m))
 5 print(m.string)
 6 print(m.pos, m.endpos)
 7 
 8 # 分组
 9 print('分组')
10 m = re.match(r'^(d{3})-(d{3,8})$', '010-12345')
11 print(m.groups())
12 print(m.group(0))
13 print(m.group(1))
14 print(m.group(2))
15 
16 # 分割
17 print('分割')
18 p = re.compile(r'd+')
19 print(type(p))
20 print(p.split('one1two3three3four4'))
21 
22 t = '20:15:45'
23 m = re.match(r'^(0[0-9]|1[0-9]|2[0-3]|[0-9]):(0[0-9]|1[0-9]|2[0-9]|3[0-9]|4[0-9]|5[0-9]|[0-9]):(0[0-9]|1[0-9]|2[0-9]|3[0-9]|4[0-9]|5[0-9]|[0-9])$', t)
24 print(m.groups())

输出结果：

010-12345
0 9
分组
('010', '12345')
010-12345
010
12345
分割
<class '_sre.SRE_Pattern'>
['one', 'two', 'three', 'four', '']
('20', '15', '45')

电商网站数据爬取

selenium直接pip安装即可。pip install selenium

windows上需要使用使用浏览器的驱动，我使用的chrome浏览器，和课程中的一样。驱动是chromedriver。

我这里提供一个下载地址：http://docs.seleniumhq.org/download/

我的驱动是放在tools这个文件夹里的。

下载好驱动后，需要将这个驱动添加到系统属性变量中才行，不然会出错。

准备工作已经完成了。下面我们开始爬取17huo.com这个网站.我们要爬取大衣这个分类里的每个商品的标题、价格。课程的时间已经过去很久，

网站已经改版，我对课程中的代码自己进行了改动，实测可用，成功爬取前三页的信息。0、1、2共三页。

 1 from selenium import webdriver
 2 import time
 3 
 4 browser = webdriver.Chrome()
 5 browser.set_page_load_timeout(50)
 6 browser.get('http://www.17huo.com/newsearch/?k=%E5%A4%A7%E8%A1%A3')
 7 page_info = browser.find_element_by_css_selector('body > div.wrap > div.search_container > div.pagem.product_list_pager > div')
 8 # print(page_info.text)
 9 # 共 40 页，每页 60 条
10 pages = int((page_info.text.split('，')[0]).split(' ')[1])
11 # print(pages)
12 for page in range(pages):
13     if page > 2:
14         break
15     url = 'http://www.17huo.com/newsearch/?k=大衣&page=' + str(page + 1)
16     browser.get(url)
17     browser.execute_script("window.scrollTo(0, document.body.scrollHeight);")
18     time.sleep(5)  # 不然会load不完整
19     goods = browser.find_element_by_css_selector(
20         '.book-item-list').find_elements_by_tag_name('a')
21     print('%d页有%d件商品' % ((page + 1), len(goods)))
22     for good in goods:
23         try:
24             title = good.find_element_by_css_selector('a:nth-child(1) > p:nth-child(2)').text
25              #a:nth - child(2) > div:nth - child(3) > div:nth - child(2)
26             price = good.find_element_by_css_selector('span:nth - child(1)').text
27             #span:nth - child(1)
28             print(title, price)
29         except:
30             print(good.text)

部分结果：

 1 1页有180件商品
 2 
 3 ￥ 155.00
 4 黄格子大衣
 5 黄格子大衣
 6 
 7 ￥ 350.00
 8 中老年妈妈冬季仿貂绒大衣连帽女装宽松外套羊剪绒上衣
 9 KXLCMML1308
10 
11 ￥ 350.00
12 中老年女装冬新款羊剪绒加厚仿皮草宽松外套妈妈装大衣
13 KXLCMML1307

情不知所起一往而深