python爬虫入门

简单的爬虫

from urllib.request import urlopen
html = urlopen('http://pythonscraping.com/pages/page1.html')
print(html.read())

上述代码就会获取该网页的全部的HTML代码。

使用beautifulSoup来解析HTML文档

from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen('http://pythonscraping.com/pages/page1.html')
# 这里需要指定html.parser，否则会报No parser was explicitly specified的warning
bsObj = BeautifulSoup(html.read(), 'html.parser')
print(bsObj.h1)

在网络连接不可靠时

如果我们在urlopen一个不可靠的连接时，也许是因为服务器上不存在这个页面，也许是目标服务器宕机等原因。那我们的爬虫程序就会抛出异常
因此我们需要一个完备的异常处理机制

try:
    html = urlopen('http://pythonscraping.com/pages/page1.html')
except HTTPError as e:
    print(e)
else:
    # 程序继续。注意：如果你已经在上面异常捕捉那一段代码里返回或中断 （ break），
    # 那么就不需要使用else语句了，这段代码也不会执行

或者在某些情况下如果一个标签不存在

try:
    html = urlopen(url)
except (HTTPError, URLError) as e:
    return None
try:
    bsObj = BeautifulSoup(html.read())
    title = bsObj.body.h1
except AttributeError as e:
    return None

上面的例子中，就处理了url打不开和获取不到标签信息的异常

BeautifulSoup应用

# 查找所有的class为green的span标签
nameList = bsObject.findAll('span', {'class': 'green'})
# 查找单条class为green的span标签
nameList = bsObject.find('span', {'class': 'green'})
# 获取标签内部的文本, get_text()方法会将标签全部清除，只剩下一段不带标签的文字
text = node.get_text()
# 查找一组标签
nodeList = bsObj.findAll({'h1', 'h2', 'h3'})
# 获取div后代所有的img标签
imgList = bsObj.div.findAll('img')
# 获取所有的子标签
childrenList = bsObj.find('table', {'id': 'giftList'}).children
# 获取后面的所有兄弟节点
siblingsList = bsObj.find('table').tr.next_siblings
# 获取前面的所有兄弟节点
previousList = bsObj.find('table').previous_siblings
# 获取某个元素的父标签
parent = bsObj.find('table').parent
# 获取某条属性
attr = myTag.attrs['src']
# 使用lamada表达式
soup.find(lambda tag: len(tag.attrs == 2))

在Beautiful中使用正则

images = bsObj.findAll("img",{"src":re.compile("../img/gifts/img.*.jpg")})

使用urllib库下载图片

主要方式是通过urlretrieve

from urllib.request import urlretrieve
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://www.pythonscraping.com")
bsObj = BeautifulSoup(html, "html.parser")
imgUrl = bsObj.find("a", {"id": "logo"}).find("img")["src"]
urlretrieve(imgUrl, "logo.png")

使用requests库做爬虫登录

import requests
params = {'firstname': 'Ryan', 'lastname': 'Mitchell'}
r = requests.post('http://pythonscraping.com/files/processing.php', data=params)
print(r.text)
# 获取授权cookie
r.cookies.get_dict()