python之爬虫（一）

刚接触python爬虫不久，想对学习的spider机制做一个简单的归纳。

爬虫，即网络爬虫，大家可以理解为在网络上爬行的一直蜘蛛，互联网就比作一张大网，而爬虫便是在这张网上爬来爬去的蜘蛛咯，如果它遇到资源，那么它就会抓取下来。想抓取什么？这个由你来控制它咯。

比如它在抓取一个网页，在这个网中他发现了一条道路，其实就是指向网页的超链接，那么它就可以爬到另一张网上来获取数据。这样，整个连在一起的大网对这只蜘蛛来说触手可及，想抓取什么就可以随这只蜘蛛的想法而决定了。

爬虫的基本流程：

抓取网络目标，用户获取网络数据的方式有三种：

方式一：浏览器提交请求-----> 下载网页代码 ----> 解析页面

#!/usr/bin/env python2
# -*- coding: utf-8 -*-
# author：Momo time:2018/6/28

import urllib2

url = "http://baike.baidu.com/item/Python"
# 第一种方法：最简洁的方法
print '第一种方法'

response1 = urllib2.urlopen(url)  # 直接请求
print response1.getcode()  # 获取状态码，如果是200表示获取成功
print len(response1.read())  # 读取内容response.read

这种方法最简单

方式二：模拟浏览器发生请求（获取网页代码）-----> 提取有用的数据 ------> 存放于数据库或者文件中

#!/usr/bin/env python2
# -*- coding: utf-8 -*-
# author：Momo time:2018/6/28

import urllib2

# 第二种方法：添加data、http header
print '第二种方法'
request = urllib2.Request(url)  # 创建request对象
# 添加数据:request.add_data('a','1')
request.add_header("user-agent", "Mozill/5.0")  # 添加http的header
response2 = urllib2.urlopen(request)  # 发送请求结果
print response2.getcode()
print len(response2.read())

第三种方法：添加特殊场景的处理器

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# author：Momo time:2018/6/28

print'第三章方法'
cj = cookielib.CookieJar()  # 创建cookie容器
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))  # 创建一个opener
urllib2.install_opener(opener)  # 给urllib2安装opener
response3 = urllib2.urlopen(url)  # 使用带有cookie的urllib2访问网页
print response3.getcode()
print cj
print response3.read()

网页搜索策略

正则表达式或者xpath，个人觉得xpath比较好用。

下面是一段我在百度贴吧里用正则表达式抓取图片的代码。

 1 #!/usr/bin/env python2
 2 # -*- coding: utf-8 -*-
 3 # author：Momo time:2018/6/28
 4 
 5 
 6 import urllib
 7 import re
 8 
 9 def get_html(url):
10     page = urllib.urlopen(url)
11     html_code = page.read()
12     return html_code
13 
14 def get_image(html_code):
15     reg = r'src="(.+?\.jpg)" width'
16     reg_img = re.compile(reg)
17     img_list = reg_img.findall(html_code)
18     x = 0
19     for img in img_list:
20         urllib.urlretrieve(img, '%s.jpg' % x)
21         x += 1
22 
23 print u'-------网页图片抓取-------'
24 print u'请输入url:',
25 url = raw_input()
26 print url
27 if url:
28     pass
29 else:
30     print u'---没有地址输入正在使用默认地址---'
31     url = 'http://tieba.baidu.com/p/1753935195'
32 print u'----------正在获取网页---------'
33 html_code = get_html(url)
34 print u'----------正在下载图片---------'
35 get_image(html_code)
36 print u'-----------下载成功-----------'
37 raw_input('Press Enter to exit')

这是用python3 写的相同功能的代码：

 1 #!/usr/bin/env python3
 2 # -*- coding: utf-8 -*-
 3 # author：Momo time:2018/6/29
 4 
 5 import urllib.request
 6 import urllib
 7 import re
 8 
 9 # url = "http://tieba.baidu.com/p/1753935195"
10 
11 def get_html(url):
12     response = urllib.request.urlopen(url)
13     html_code = response.read().decode('utf-8')
14     return html_code
15 
16 def get_img(html_code):
17     reg = r'src="(.+?\.jpg)" width'
18     reg_img = re.compile(reg)
19     imglist = reg_img.findall(html_code)
20     x =0
21     for img in imglist:
22         urllib.request.urlretrieve(img, "tieba%s.jpg" % x)
23         x += 1
24 
25 print("请用户输入url：")
26 url = input()
27 if url != "":
28     pass
29 else:
30     print("----------用户没有输入url，正在使用默认地址---------")
31     url = "http://tieba.baidu.com/p/1753935195"
32 print("---------正在获取网页信息-------------")
33 html_code = get_html(url)
34 print("---------正在下载网页图片-------------")
35 get_img(html_code)

这里需要注意的是，python2 和 python3 在网页获取和读取网页代码这块有点区别

python2

import urllib

def get_html(url):
    page = urllib.urlopen(url)
    html_code = page.read()
    return html_code

python3

import urllib.request
import urllib

def get_html(url):
    response = urllib.request.urlopen(url)
    html_code = response.read().decode('utf-8')
    return html_code