9.3.4 BeaufitulSoup4

　　BeautifulSoup 是一个非常优秀的Python扩展库，可以用来从HTML或XML文件中提取我们感兴趣的数据，并且允许指定使用不同的解析器。
　　使用 pip install BeaufifulSoup4 直接进行模块的安装。安装之后应使用 from bs4 import BeautifulSoup 导入并使用。
　　下面简单演示下BeautifulSoup4的功能，更加详细完整的学习资料请参考 https://www.crummy.com/software/BeautifulSoup/bs4/doc/。
  1 >>> from bs4 import BeautifulSoup
  2 >>> 
  3 >>> #自动添加和补全标签
  4 >>> BeautifulSoup('hello world','lxml')
  5 <html><body><p>hello world</p></body></html>
  6 >>> 
  7 >>> #自定义一个html文档内容
  8 >>> html_doc = """
  9 <html><head><title>The Dormouse's story</title></head>
 10 <body>
 11 <p class="title"><b>The Dormouse's story</b></p>
 12 <p class="story">Once upon a time there were three little sisters;and their names were
 13 <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
 14 <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>and
 15 <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
 16 and they lived at the bottom of a well.</p>
 17 
 18 <p class="story">...</p>
 19 """
 20 >>> 
 21 >>> #解析这段html文档内容，以优雅的方式展示出来
 22 >>> soup = BeautifulSoup(html_doc,'html.parser')
 23 >>> print(soup.prettify())
 24 <html>
 25  <head>
 26   <title>
 27    The Dormouse's story
 28   </title>
 29  </head>
 30  <body>
 31   <p class="title">
 32    <b>
 33     The Dormouse's story
 34    </b>
 35   </p>
 36   <p class="story">
 37    Once upon a time there were three little sisters;and their names were
 38    <a class="sister" href="http://example.com/elsie" id="link1">
 39     Elsie
 40    </a>
 41    ,
 42    <a class="sister" href="http://example.com/lacie" id="link2">
 43     Lacie
 44    </a>
 45    and
 46    <a class="sister" href="http://example.com/tillie" id="link3">
 47     Tillie
 48    </a>
 49    ;
 50 and they lived at the bottom of a well.
 51   </p>
 52   <p class="story">
 53    ...
 54   </p>
 55  </body>
 56 </html>
 57 >>> 
 58 >>> #访问特定标签
 59 >>> soup.title
 60 <title>The Dormouse's story</title>
 61 >>> 
 62 >>> #标签名字
 63 >>> soup.title.name
 64 'title'
 65 >>> 
 66 >>> #标签文本
 67 >>> soup.title.text
 68 "The Dormouse's story"
 69 >>> 
 70 >>> #title标签的上一级标签
 71 >>> soup.title.parent
 72 <head><title>The Dormouse's story</title></head>
 73 >>> 
 74 >>> soup.head
 75 <head><title>The Dormouse's story</title></head>
 76 >>> 
 77 >>> soup.b
 78 <b>The Dormouse's story</b>
 79 >>> 
 80 >>> soup.b.name
 81 'b'
 82 >>> soup.b.text
 83 "The Dormouse's story"
 84 >>> 
 85 >>> #把整个BeautifulSoup对象看作标签对象
 86 >>> soup.name
 87 '[document]'
 88 >>> 
 89 >>> soup.body
 90 <body>
 91 <p class="title"><b>The Dormouse's story</b></p>
 92 <p class="story">Once upon a time there were three little sisters;and their names were
 93 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 94 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>and
 95 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
 96 and they lived at the bottom of a well.</p>
 97 <p class="story">...</p>
 98 </body>
 99 >>> 
100 >>> soup.p
101 <p class="title"><b>The Dormouse's story</b></p>
102 >>> 
103 >>> #标签属性
104 >>> soup.p['class']
105 ['title']
106 >>> 
107 >>> soup.p.get('class')         #也可以这样查看标签属性
108 ['title']
109 >>> 
110 >>> soup.p.text
111 "The Dormouse's story"
112 >>> 
113 >>> soup.p.contents
114 [<b>The Dormouse's story</b>]
115 >>> 
116 >>> soup.a
117 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
118 >>> 
119 >>> #查看a标签所有属性
120 >>> soup.a.attrs
121 {'class': ['sister'], 'id': 'link1', 'href': 'http://example.com/elsie'}
122 >>> 
123 >>> #查找所有a标签
124 >>> soup.find_all('a')
125 [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
126 >>> 
127 >>> #同时查找<a>和<b>标签
128 >>> soup.find_all(['a','b'])
129 [<b>The Dormouse's story</b>, <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
130 >>> 
131 >>> import re
132 >>> #查找href包含特定关键字的标签
133 >>> soup.find_all(href=re.compile("elsie"))
134 [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
135 >>> 
136 >>> soup.find(id='link3')
137 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
138 >>> 
139 >>> soup.find_all('a',id='link3')
140 [<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
141 >>> 
142 >>> for link in soup.find_all('a'):
143     print(link.text,':',link.get('href'))
144 
145     
146 Elsie : http://example.com/elsie
147 Lacie : http://example.com/lacie
148 Tillie : http://example.com/tillie
149 >>> 
150 >>> print(soup.get_text())           #返回所有文本
151 
152 The Dormouse's story
153 
154 The Dormouse's story
155 Once upon a time there were three little sisters;and their names were
156 Elsie,
157 Lacieand
158 Tillie;
159 and they lived at the bottom of a well.
160 ...
161 
162 >>> 
163 >>> #修改标签属性
164 >>> soup.a['id']='test_link1'
165 >>> soup.a
166 <a class="sister" href="http://example.com/elsie" id="test_link1">Elsie</a>
167 >>> 
168 >>> #修改标签文本
169 >>> soup.a.string.replace_with('test_Elsie')
170 'Elsie'
171 >>> 
172 >>> soup.a.string
173 'test_Elsie'
174 >>> 
175 >>> print(soup.prettify())
176 <html>
177  <head>
178   <title>
179    The Dormouse's story
180   </title>
181  </head>
182  <body>
183   <p class="title">
184    <b>
185     The Dormouse's story
186    </b>
187   </p>
188   <p class="story">
189    Once upon a time there were three little sisters;and their names were
190    <a class="sister" href="http://example.com/elsie" id="test_link1">
191     test_Elsie
192    </a>
193    ,
194    <a class="sister" href="http://example.com/lacie" id="link2">
195     Lacie
196    </a>
197    and
198    <a class="sister" href="http://example.com/tillie" id="link3">
199     Tillie
200    </a>
201    ;
202 and they lived at the bottom of a well.
203   </p>
204   <p class="story">
205    ...
206   </p>
207  </body>
208 </html>
209 >>> 
210 >>> 
211 >>> #遍历子标签
212 >>> for child in soup.body.children:
213     print(child)
214 
215     
216 
217 
218 <p class="title"><b>The Dormouse's story</b></p>
219 
220 
221 <p class="story">Once upon a time there were three little sisters;and their names were
222 <a class="sister" href="http://example.com/elsie" id="test_link1">test_Elsie</a>,
223 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>and
224 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
225 and they lived at the bottom of a well.</p>
226 
227 
228 <p class="story">...</p>
229 
230 
231 >>>