进击的爬虫-003-beautifulsoup实现猫眼电影前100爬取

BeutifulSoup

beautifulsoup是python的一个xml , html解析库, 借助网页的结构和属性等特性来解析网页,只需要简单的几条语句, 就可以用来方便的从网页中提取数据

选择解释器

beautifulsoup在解析的时候需要依赖解析器

1. python标准库 BeautifulSoup(text, 'html.parser)
2. lxml HTML解析器 BeautifulSoup(text, 'lxml')
3. lxml XML解析器 BeautifulSoup(text, 'xml')
4. html5lib BeautifulSoup(text, 'html5lib')
推荐使用 lxml HTML 解析器

基本用法

soup = BeautfifulSoup(text, 'lxml')
soup.prettify() 把要解析的字符串以标准的缩进格式输出
soup.p.string string属性获取文本内容

节点选择器

选择元素 soup.p 如果有多个p元素,只找到第一个
提取属性 soup.p.attrs得到一个字典
soup.p['class'] 获取属性的值, 只可能是字符串也可能是列表
嵌套选择 soup.p.a.string
关联选择子节点 children 子孙节点 descdants
关联选择父节点 parent 祖先节点 parents
兄弟节点 next-sibling previous-sibling

方法选择器

find_all() 找到所有满足条件的标签, 放在一个列表中
find() 找到第一个满足条件的列表
css 选择器, beautifulsoup还提供了css选择器,对web比较熟悉, 想使用css选择器来选择标签的小伙伴可以使用 pyquery 解析库这了就不做介绍了

beautifulsoup实现猫眼电影前100爬取

from bs4 import BeautifulSoup as bs
import requests

def get_movie_info(ret):
    soup = bs(ret.text, 'lxml')  #用beautifulsoup库 处理前端页面
    all_dd = soup.find_all('dd')  #找到页面中的每个dd标签
    for content in range(10):     #每个dd标签中都包含着一个电影的信息
        num = all_dd[content].i.string  #获取当前电影的排名
        movie_infos = [num]
        for p in all_dd[content].find_all('p'):
            if p.string:   #分别获取 电影名, 主演, 上映时间等,
                movie_infos.append(p.string.strip())

        movie = f'排名:{movie_infos[0]}, 电影名:{movie_infos[1]}, {movie_infos[2]}, {movie_infos[3]}'
        print(movie)

url = 'https://maoyan.com/board/4'

for offset in range(10):
    data = {
        'offset':offset * 10
    }
    ret = requests.get(url, params=data)  #获取前端页面
    get_movie_info(ret)  #调用函数, 处理前端页面