Beautiful Soup

Beautiful Soup是一个可以从HTML或XML文件中提取数据的Python库,它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式.

import requests
from bs4 import BeautifulSoup
link="http://category.tudou.com/category/c_96_r_2019_p_1.html"
headers={'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 SE 2.X MetaSr 1.0'}
req=requests.get(link,headers=headers)

#得到一个 BeautifulSoup 的对象,并能按照标准的缩进格式的结构输出:
soup=BeautifulSoup(req.text,"lxml")
print(soup.prettify())

#从文档中找到所有<link>标签的链接:
for lk in soup.find_all('link'):
    print(lk.get('href'))

Beautiful Soup支持Python标准库中的HTML解析器,还支持一些第三方的解析器,其中一个是 lxml .

下表列出了主要的解析器,以及它们的优缺点:

解析器使用方法优势劣势
Python标准库 BeautifulSoup(markup, "html.parser")
  • Python的内置标准库
  • 执行速度适中
  • 文档容错能力强
  • Python 2.7.3 or 3.2.2)前 的版本中文档容错能力差
lxml HTML 解析器 BeautifulSoup(markup, "lxml")
  • 速度快
  • 文档容错能力强
  • 需要安装C语言库
lxml XML 解析器

BeautifulSoup(markup, ["lxml", "xml"])

BeautifulSoup(markup, "xml")

  • 速度快
  • 唯一支持XML的解析器
  • 需要安装C语言库
html5lib BeautifulSoup(markup, "html5lib")
  • 最好的容错性
  • 以浏览器的方式解析文档
  • 生成HTML5格式的文档
  • 速度慢
  • 不依赖外部扩展

遍历文档树:

 一个Tag可能包含多个字符串或其它的Tag,这些都是这个Tag的子节点.Beautiful Soup提供了许多操作和遍历子节点的。

print(soup.head)

<head><meta content="IE=Edge" http-equiv="X-UA-Compatible"/><link href="//static.youku.com" rel="dns-prefetch"/><link href="//static.ykimg.com" rel="dns-prefetch"/><link href="//r1.ykimg.com" rel="dns-prefetch"/><link href="//r2.ykimg.com" rel="dns-prefetch"/><link href="//r3.ykimg.com" rel="dns-prefetch"/><link href="//r4.ykimg.com" rel="dns-prefetch"/><link href="//g1.ykimg.com" rel="dns-prefetch"/><link href="//g2.ykimg.com" rel="dns-prefetch"/><link href="//g3.ykimg.com" rel="dns-prefetch"/><link href="//g4.ykimg.com" rel="dns-prefetch"/><link href="//p.l.youku.com" rel="dns-prefetch"/><link href="//urchin.lstat.youku.com" rel="dns-prefetch"/><link href="//html.atm.youku.com" rel="dns-prefetch"/><meta content="text/html; charset=utf-8" http-equiv="content-type"/><meta content="zh-cn" http-equiv="content-language"/><title>剧综影漫_土豆视频</title><meta content="视频,视频分享,视频搜索,视频播放,土豆视频" name="keywords"/><meta content="土豆-中国第一视频网站,提供视频播放,视频发布,视频搜索 - 视频服务平台,提供视频播放,视频发布,视频搜索,视频分享 - 土豆视频" name="description"/><meta content="a2h28" name="data-spm"/><link href="/favicon.ico" rel="shortcut icon"/><link href="//static.youku.com/yk/lib/css/tudou.8f50c0ed37.css" rel="stylesheet"/><link href="//static.youku.com/yk/newtudou/css/pc/category/category.340b6db21c.css" rel="stylesheet"/><script>var Local={"domain":{"default":"www.youku.com","test":"test.youku.com","subscribe":"ding.youku.com","uc":"i.youku.com","video":"v.youku.com","rz":"rz.youku.com","userlive":"userlive.youku.com","esign":"hetong.youku.com","listpage":"list.youku.com","xinterest":"x.youku.com","ypartner":"yp.youku.com","interact":"hudong.pl.youku.com","creation":"mp.tudou.com","uctg":"uctg.youku.com","playlists":"playlists.youku.com","static":"static.youku.com","passport":"account.youku.com","static_ext":"static.ykimg.com","static_ext_js":"js.ykimg.com","static_ext_css":"css.ykimg.com"},"service":{"push":"push.youku.com","interact":"hudong.pl.youku.com"},"debug":false};</script><script>var require = {"baseUrl": "//static.youku.com/newtudou/js/"};</script><script>if(require){require.paths={"main.category": "//static.youku.com/yk/newtudou/js/pc/category/main.category.f22a91da07"};}</script><script data-main="main.category" src="//static.youku.com/yk/lib/js/base.tudou.464a1349ea.js"></script></head>

print(soup.title)
<title>剧综影漫_土豆视频</title>
print(soup.title.text)
剧综影漫_土豆视频

这是个获取tag的小窍门,可以在文档树的tag中多次调用这个方法.下面的代码可以获取<body>标签中的第一个<b>标签:
print(soup.body.b)
<b class="line-after"></b>
print(soup.body.b['class'])
['line-after']

按照CSS类名搜索tag的功能非常实用,但标识CSS类名的关键字 class 在Python中是保留字,使用 class 做参数会导致语法错误.从Beautiful Soup的4.1.1版本开始,可以通过 class_
<div class="hd">
<a class="" href="https://movie.douban.com/subject/1291546/">
<span class="title">霸王别姬</span>
<span class="other"> / 再见,我的妾  /  Farewell My Concubine</span>
</a>
<span class="playable">[可播放]</span>
</div>
<div class="hd">
<a class="" href="https://movie.douban.com/subject/1295644/">
<span class="title">这个杀手不太冷</span>
<span class="title"> / Léon</span>
<span class="other"> / 杀手莱昂  /  终极追杀令(台)</span>
</a>
<span class="playable">[可播放]</span>
</div>


import requests
from bs4 import BeautifulSoup
link="https://movie.douban.com/top250?start=1"
headers={'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 SE 2.X MetaSr 1.0'}
req=requests.get(link,headers=headers)
soup=BeautifulSoup(req.text)
div_list=soup.find_all("div",class_="hd")
for ls in div_list:
    print(ls.a.span.text)

霸王别姬
这个杀手不太冷
阿甘正传
美丽人生
泰坦尼克号
千与千寻
辛德勒的名单
盗梦空间
忠犬八公的故事
机器人总动员
三傻大闹宝莱坞
.......
 
import requests
from bs4 import BeautifulSoup
link="http://category.tudou.com/category/c_96_r_2019_p_1.html"
headers={'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 SE 2.X MetaSr 1.0'}
req=requests.get(link,headers=headers)
soup=BeautifulSoup(req.text,"lxml")
'''print(soup.p)'''
div_list=soup.find_all('a',class_='v-meta__title__link')
for ls in div_list:
    print(ls.text)
    print(ls.attrs['href'])
    print(ls.attrs['title'])
    print(ls.attrs)
print(ls)
雪暴
//video.tudou.com/v/XNDIyNjAzNzg0OA==.html
雪暴
{'href': '//video.tudou.com/v/XNDIyNjAzNzg0OA==.html', 'target': 'video', 'title': '雪暴', 'class': ['v-meta__title__link'], 'data-spm': ''}
<a class="v-meta__title__link" data-spm="" href="//video.tudou.com/v/XNDIyNjAzNzg0OA==.html" target="video" title="雪暴">雪暴</a>
流浪地球
//video.tudou.com/v/XNDE0ODQ5NzczNg==.html
流浪地球
{'href': '//video.tudou.com/v/XNDE0ODQ5NzczNg==.html', 'target': 'video', 'title': '流浪地球', 'class': ['v-meta__title__link'], 'data-spm': ''}
<a class="v-meta__title__link" data-spm="" href="//video.tudou.com/v/XNDE0ODQ5NzczNg==.html" target="video" title="流浪地球">流浪地球</a>
..........................

原文地址:https://www.cnblogs.com/playforever/p/11016189.html