python抓取网络内容

最近想做研究互联网来获取数据，只是有一点python,让我们来看一个比较简单的实现。

例如，我想抓住奥巴马的每周演讲http://www.putclub.com/html/radio/VOA/presidentspeech/index.html，手动提取，就须要一个个点进去，再复制保存，很麻烦。

那有没有一步到位的方法呢。用python这样的强大的语言就能高速实现。

首先我们看看这网页的源代码

能够发现。我们要的信息就在这样一小条url中。

更详细点说，就是我们要遍历每一个类似http://www.putclub.com/html/radio/VOA/presidentspeech/2014/0928/91326.html这种网址，而这网址须要从上面的网页中提取。

好。開始写代码

首先打开这个文件夹页。保存在content

import sys,urllib
url="http://www.putclub.com/html/radio/VOA/presidentspeech/index.html"
wp = urllib.urlopen(url)
print "start download..."
content = wp.read()

以下要提取出每一篇演讲的内容

详细思路是搜索“center_box”之后，每一个“href=”和“target”之间的内容。

为什么是这两个之间，请看网页源代码。

得到的就是每一篇的url，再在前面加上www.putclub.com就是每一篇文章的网址啦

print content.count("center_box")
index =  content.find("center_box")
content=content[content.find("center_box")+1:]
content=content[content.find("href=")+7:content.find("target")-2]
filename = content
url ="http://www.putclub.com/"+content
print content
print url
wp = urllib.urlopen(url)
print "start download..."
content = wp.read()

有了文章内容的url后。相同的方法筛选内容。

#print content
print content.count("<div class="content"")
#content = content[content.find("<div class="content""):]
content = content[content.find("<!--info end------->"):]
content = content[:content.find("<div class="dede_pages"")-1]
filename = filename[filename.find("presidentspeech")+len("presidentspeech/"):]

最后再保存并打印

filename = filename.replace('/',"-",filename.count("/"))
fp = open(filename,"w+")
fp.write(content)
fp.close()
print content

OK，大功告成！保存成.pyw文件，以后仅仅需双击就直然后存储在obama每周简报~