爬虫小记-1

最近学习无趣，想找点乐子，知乎上刷了一波，感觉爬虫挺好玩，于是就来一波呗

感觉这个教程写的还是不错的，链接给上：Python爬虫学习系列教程

记下要点：

1.测试网页脚本

#encoding=utf-8
import urllib
import urllib2
import cookielib
import time
import re
class urltest:
    def __init__(self):
        filename = 'cookie.txt'
        self.cookie = cookielib.MozillaCookieJar(filename)
        self.opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(self.cookie))
        urllib2.install_opener(self.opener)
        pass
    def open(self,url,values={},header={},needresponse=True):#values为post数据，header为数据头
        data = urllib.urlencode(values) 
        request = urllib2.Request(url,data,header)
        response = urllib2.urlopen(request)
        t=0
        if(needresponse):
            t=response.read()
        return t

t=urltest()
headers = {'Host': 'www.baidu.com',
                    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:48.0) Gecko/20100101 Firefox/48.0',
                    'Accept': 'text/plain, */*',
                    'Accept-Language': 'zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3',
                    'Accept-Encoding': 'gzip, deflate',
                    'Content-Type': 'application/x-www-form-urlencoded',
                    'X-Requested-With': 'XMLHttpRequest',
                    'Referer': 'http://www.baidu.com',
                    'Connection': 'keep-alive'}
values={'zh':'123456789','mm':'123456789'}
t.open("http://www.baidu.com", values,needresponse=False)
print t.open("http://www.baidu.com")

这个脚本可用来带cookie测试网页，可自定义header，post数据

2.网页解析

1.正则表达式

掌握简单的正则表达式用匹配获取指定内容还是有必要的。

给个获取博客随笔标题名的脚本:

#encoding=utf-8
import re
import lxml
t=urltest()
text=t.open('http://www.cnblogs.com/Rainlee007/').decode('utf-8')
text=text.replace(u'
', '')#删去换行符
text=text.replace(u'	', '')#删去制表符
text=text.replace(u'&nbsp;', '')#删去空格
reObj=re.findall(u'<div class="postTitle">.*?</div>', text)#匹配
for i in reObj:
    xml=etree.XML(i)
    print xml[0].text

python使用正则表达式很简单re.findall()即可，返回为匹配的字符串数组

关于正则表达式语法，见正则表达式30分钟入门教程

2.lxml

lxml很好处理xml和html，利用etree.xml(xml)可获取根节点，每个节点可获取其标签，属性等，按照数组形式可获取子节点。

xml='''
        <div id="footer_bottom">
            <div>
                <a href="/AboutUS.aspx">关于博客园</a>
                <a href="/ContactUs.aspx">联系我们</a>
                2004-2016
                <a href="http://www.cnblogs.com/">博客园</a>
                保留所有权利
                <a href="http://www.miitbeian.gov.cn" target="_blank">沪ICP备09004260号</a>
            </div>
            
            <div>
                <a href="https://ss.knet.cn/verifyseal.dll?sn=e131108110100433392itm000000&ct=df&a=1&pa=0.25787803245785335" rel="nofollow" target="_blank">
                    <img id="cnnic_img" src="//common.cnblogs.com/images/cnnic.png" alt="" height="23" width="64"/>
                </a>
                <a target="_blank" href="http://www.beian.gov.cn/portal/registerSystemInfo?recordcode=31011502001144" style="display:inline-block;text-decoration:none;height:20px;line-height:20px;">
                    <img src="//common.cnblogs.com/images/ghs.png" alt=""/>
                        <span style="float:left;height:20px;line-height:20px;margin: 0 5px 0 5px; color:#939393;">
                            沪公网安备 31011502001144号
                        </span>
                </a>
            </div>
        </div>
        '''.decode('utf-8')

root=etree.XML(xml)
print root.tag# div 
print root[0].tag# div 获取标签名
print root[0][1].get('href')#/ContactUs.aspx  获取属性href的值
print root[0][0].text#关于博客园  获取第一个子元素前的文本
print root[0][1].tail#2004-2016  获取该元素结束标签后，下一元素前的文本

其余见API lxml API

将正则表达式和lxml结合可以很容易获取想要信息。

3.MySQL数据库

保存大量的数据当然采用SQL数据库，上代码吧

#encoding=utf8
import MySQLdb


conn=MySQLdb.connect(host='localhost',user='root',passwd='12345789',db='databaseName',charset='utf8',port=3306)
cur=conn.cursor()

cur.execute("select * from tableName where 学号='2013141011004'")
a=cur.fetchall() #返回为字符串数组
print a

value=['男',
 '51111111111111111',
 '电子信息学院',
 '电子信息工程',
 '汉族',
 '北京',
 '清水河',
 '122050104']
cur.execute("UPDATE englishcompetetion SET 性别 = %s,学院=%s,专业=%s, 民族=%s, 籍贯=%s, 校区=%s, 班级 = %s WHERE 学号 = '201514102037'",value)
conn.commit()

cur.close()
conn.close()