利用beautifulsoup进行对标签的二次查找-以打印网易云歌单内容为例

以网易云歌单为例:https://music.163.com/#/playlist?id=3212113629

坑在于要提交的网站,从下图可以看到要提交的网站是https://music.163.com/playlist?id=3212113629,而非直接复制的https://music.163.com/#/playlist?id=3212113629

 开始工作

首先获取页面html

import requests
from bs4 import BeautifulSoup

url = "https://music.163.com/playlist?id=3212113629"    # 注意直接复制的地址有个#,这里要去掉
demo = requests.get(url).text
soup = BeautifulSoup(demo, "html.parser")

通过打印soup可以看到要下载的内容在ul标签

for ss in soup.find_all('ul'):
    print(ss)

运行会发现,有许多ul标签

 可以进行二次查找

for ss0 in soup.find_all('ul',{"class":"f-hide"}):
    print(ss0)
或者
for ss0 in soup.find_all('ul',class:="f-hide"):

 

 完整代码为:

# write by xdd1997  xdd2026@qq.com
# 2020-08-07

import requests
from bs4 import BeautifulSoup

url = "https://music.163.com/playlist?id=3212113629"
demo = requests.get(url).text
soup = BeautifulSoup(demo, "html.parser")

for ss0 in soup.find_all('ul',{"class":"f-hide"}):
    for ii in ss0.find_all('a'):
        print(ii.string)


更新,添加头部信息 2020-08-20

第二天运行了几次后,发现不行了,应该是网易云有来源审查反爬机制,故添加了头部信息

#encoding = utf8
# write by xdd1997  xdd2026@qq.com
# 2020-08-20

import requests
from bs4 import BeautifulSoup
url
= "https://music.163.com/playlist?id=5138652624" # 注意直接复制的网址要去掉#号 try: kv = {'user-agent':'Mozilla/5.0'} #应对爬虫审查 r = requests.get(url,headers=kv) r.raise_for_status() #若返回值不是202,则抛出一个异常 r.encoding = r.apparent_encoding except: print("进入网站失败") demo = r.text soup = BeautifulSoup(demo, "html.parser") #print(soup) index = 0 for ss in soup.find_all('ul',{"class":"f-hide"}): # 查找<ul class="f-hide"> ...</ul> for ii in ss.find_all('a'): # print(ii.string) index = index + 1 print( str(index) + ' '+ '点歌 ' + ii.string) ''' for i in soup.ul.descendants: print(i.string) print('------------------------------------------') for i in soup.ul.children: print(i.string) '''

 


转载仅为学习,不会商用。
欢迎转载原创,附文链接。
原文地址:https://www.cnblogs.com/xdd1997/p/13530770.html