python爬虫rp+bs4

一、开发环境

Beautiful Soup 4.4.0 文档: http://beautifulsoup.readthedocs.io/zh_CN/latest/#id28

Requests : http://cn.python-requests.org/zh_CN/latest/

Python3

使用Requests处理http,post请求。Beautiful Soup处理HTML页面标签并提取信息。

from urllib import request
from bs4 import BeautifulSoup

二、方法

使用BeautifulSoup解析这段代码,能够得到一个 BeautifulSoup 的对象,并能按照标准的缩进格式的结构输出

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

print(soup.prettify())
# <html>
#  <head>
#   
#   ............
#
#  </body>
# </html>

几个简单的浏览结构化数据的方法:

soup.title
# <title>The Dormouse's story</title>

soup.title.name
# u'title'

soup.title.string
# u'The Dormouse's story'

soup.title.parent.name
# u'head'

soup.p
# <p class="title"><b>The Dormouse's story</b></p>

soup.p['class']
# u'title'

soup.a
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

soup.find_all('a')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.find(id="link3")
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

soup = BeautifulSoup(html_doc)

soup 就是BeautifulSoup处理格式化后的字符串，soup.title 得到的是title标签，soup.p 得到的是文档中的第一个p标签，要想得到所有标签，得用find_all

函数。

find_all 函数返回的是一个序列，可以对它进行循环，依次得到想到的东西.

get_text() 是返回文本,这个对每一个BeautifulSoup处理后的对象得到的标签都是生效的。你可以试试 print soup.p.get_text()

其实是可以获得标签的其他属性的，比如我要获得a标签的href属性的值，可以使用 print soup.a['href'],类似的其他属性，比如class也是可以这么得到的（soup.a['class']）。

特别的，一些特殊的标签，比如head标签，是可以通过soup.head 得到，其实前面也已经说了。