python 爬虫之beautifulsoup（bs4）环境准备

环境准备：

bs4安装方法：https://blog.csdn.net/Bibabu135766/article/details/81662981
 requests安装方法：https://blog.csdn.net/douguangyao/article/details/77922973

https://pypi.org/project/requests/#files

卸载pip：python -m pip uninstall pip

安装pip：https://pypi.python.org/pypi/pip#downloads

bs4用法介绍：Beautiful Soup和 lxml 一样，也是一个HTML/XML的解析器，主要的功能也是如何解析和提取 HTML/XML 数据。

https://www.cnblogs.com/amou/p/9184614.html

https://beautifulsoup.readthedocs.io/zh_CN/latest/

#!/usr/bin/env python
# -*- coding:utf-8 -*-
from bs4 import BeautifulSoup

html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
'''
#创建Beautiful Soup 对象
soup = BeautifulSoup(html,'lxml')
print soup,"--------------------------------"
# #格式化输出soup对象的内容
# print soup.prettify()

#四大对象种类 Tag、NavigableString、BeautifulSoup、Comment
#一、Tag通俗点讲就是 HTML 中的一个个标签
# print soup.html
print soup.p,'----p标签的内容'
print soup.p.attrs,'----打印p标签的属性'
print soup.p['class'],soup.p['name']
print soup.head
print soup.name,soup.head.name,'----打印标签名称'
#二、NavigableString 要想获取标签内部的文字怎么办呢？很简单，用 .string 即可
print soup.p.string,'----p标签内的文字'
print type(soup.p.string)
#三、BeautifulSoup 对象表示的是一个文档的内容。大部分时候,可以把它当作 Tag 对象，是一个特殊的 Tag，我们可以分别获取它的类型，名称，以及属性
print soup.name
print type(soup.name)
print soup.attrs,'----文档本身的属性为空'
#四、Comment 对象是一个特殊类型的 NavigableString 对象，其输出的内容不包括注释符号。
print soup.a
print soup.a.string
print type(soup.a.string),'----Comment是一种特殊的NavigableString 对象'

打印结果如下：

BeautifulSoup4查找、正则使用：

#!/usr/bin/env python
# -*- coding:utf-8 -*-
from bs4 import BeautifulSoup
import resoup = BeautifulSoup(html,'lxml')
#print soup,'------------html文档--------------'


print soup.find_all('b'),'----find b 标签'
for tag in soup.find_all(re.compile('^b')):
  print tag.name,'----re正则找出所有b开头的标签'
print soup.find_all(id='link1')
print soup.find_all(text='Tillie'),'----通过 text 参数可以搜搜文档中的字符串内容'
print soup.find_all(text=["Tillie",'Lacie'])
print soup.find_all(text=re.compile('Dormouse'))

打印结果如下：

-----------------------------------------------------------------------------------------------------------------------------------------------------------------

requests用法介绍：

http://docs.python-requests.org/zh_CN/latest/user/quickstart.html

https://cuiqingcai.com/2556.html