python3实现简单爬虫功能

本文参考虫师python2实现简单爬虫功能，并增加自己的感悟。

 1 #coding=utf-8
 2 import re
 3 import urllib.request
 4 
 5 def getHtml(url):
 6     page = urllib.request.urlopen(url)
 7     html = page.read()
 8     #print(type(html))
 9     html = html.decode('UTF-8')
10     #print(html)
11     return html
12 
13 def getImg(html):
14     reg = r'img class="BDE_Image" src="(.+?.jpg)"'
15     imgre = re.compile(reg)
16     #print(type(imgre))
17     #print(imgre)
18     imglist = re.findall(imgre,html)
19     #print(type(imglist))
20     #print(imglist)
21     num = 0
22     for imgurl in imglist:
23         urllib.request.urlretrieve(imgurl,'D:imghardaway%s.jpg' %num)
24         num+=1      
25 
26 html = getHtml("http://tieba.baidu.com/p/1569069059")
27 print(getImg(html))

re-python自带模块，用于正则表达式的相关操作
https://docs.python.org/3/library/re.html
urllib.request,来自扩展库urllib，用于打开网址相关操作
https://docs.python.org/3/installing/index.html
先定义了一个getHtml()函数
使用urllib.request.urlopen()方法打开网址
使用read()方法读取网址上的数据
使用decode()方法指定编码格式解码字符串

我这里指定的编码格式为UTF-8，根据页面源代码得出：

再定义了一个getImg()函数，用于筛选整个页面数据中我们所需要的图片地址

上文中的例子所编写的编码格式是通过查看网页源代码的方式得知的，后来我尝试了下通过正则表达式去匹配获取charset定义的编码格式，然后指定使用匹配来的编码格式。

 1 def getHtml(url):
 2     page = urllib.request.urlopen(url)
 3     html = page.read()
 4     #print(type(html))
 5     rehtml = str(html)
 6     #print(type(rehtml))
 7     reg = r'content="text/html; charset=(.+?)"'
 8     imgre = re.compile(reg)
 9     imglist = re.findall(imgre,rehtml)
10     print(type(imglist))
11     code = imglist[0]
12     print(type(code))
13     html = html.decode('%s' %code)
14     return html

说一说这里的思路，html = page.read()方法处理后，返回的为bytes对象。而re.findall()方法是无法在一个字节对象上使用字符串模式的
所以我新定义了一个变量rehtml,使用str()方法把html的值转为了字符串，供re.findall()方法使用
定义了一个新变量code用来放编码格式的值，因为re.findall()方法获取回来的是列表类型，我需要使用的是字符串类型。
根据需要的图片来编写正则表达式 reg = r’img class=”BDE_Image” src=”(.+?.jpg)”’
使用re.compile()方法把正则表达式编译成一个正则表达式对象,在一个程序中多次使用会更有效。
使用re.findall()方法匹配网页数据中包含正则表达式的非重叠数据，作为字符串列表。
urllib.request.urlretrieve()方法，将图片下载到本地，并指定到了D盘img文件夹下