Python爬取糗事百科示例代码

参考链接:http://python.jobbole.com/81351/#comment-93968

主要参考自伯乐在线的内容,但是该链接博客下的源码部分的正则表达式部分应该是有问题,试了好几次,没试成功。后来在下面的评论中看到有个使用BeautifulSoup的童鞋,试了试,感觉BeautifulSoup用起来确实很便捷。

 1 # -*- coding:utf-8 -*-
 2 
 3 '''
 4 Author:LeonWen
 5 '''
 6 
 7 import urllib
 8 import urllib2
 9 # import re
10 from bs4 import BeautifulSoup
11 
12 page = 1
13 url = 'http://www.qiushibaike.com/hot/page/' + str(page)
14 # set the headers
15 user_agent = 'Mozilla/4.0(compatible;MSIE 5.5;Windows NT)'
16 headers = {'User-Agent':user_agent}
17 try:
18     request = urllib2.Request(url,headers=headers)
19     response = urllib2.urlopen(request)
20     object_bs = BeautifulSoup(response.read())
21     # print object_bs.prettify()
22     # items 是一个list保存着返回结果
23     items = object_bs.body.find_all("div",{"class":"article block untagged mb15"})
24     # print items
25     floor = 1
26     tag = 0
27     for item in items:
28         if item.find("div",{"class":"thumb"}) == None:
29             # class=thumb为带有图片的标签
30             author = item.find("h2")
31             upNum = item.find("i",{"class":"number"})
32             content = item.find("div",{"class":"content"})
33             # print content.prettify()
34             # print content.text
35             print u"===============",floor,u" 楼 ======================="
36             print u"作者:",author.text
37             print u"赞同数:",upNum.text
38             print u"内容:",content.get_text()
39             floor += 1
40         else:
41             tag += 1
42     print u"图片个数:",tag
43 except urllib2.URLError,e:
44     if hasattr(e,"code"):
45         print e.code
46     if hasattr(e,"reason"):
47         print e.reason

原文地址:http://www.cnblogs.com/leonwen/p/5721843.html

I can
原文地址:https://www.cnblogs.com/leonwen/p/5721843.html