python 从扒minecraft.net文章学BeautifulSoup基础应用

要用BeautifulSoup,需要先下载pip。这个应该不难。然后

pip install beautifulsoup4

然后import之。因为要打开网页所以还要用到urllib.request。

from bs4 import BeautifulSoup
from urllib import request

接下来,设置url,并提取网页源代码。

>>> url="https://minecraft.net/zh-hans/article/taking-inventory-turtle-shell"
>>> page=request.urlopen(url)
>>> src=page.read()

在src里面保存的就是网页源代码了。不建议直接打印出来,不然idle会卡死。

然后,可以开始动用美汤了。

>>> Soup=BeautifulSoup(src,"html.parser")

现在Soup里面就是一堆标签集。

如果我想要提取<div class="container">的标签怎么办?用的是find:

>>> Soup.find("div", class_="container")
<div class="container">
<div class="row">
<div class="col-12">
<nav class="nav justify-content-end text-smaller">
<span id="app-profile"></span>
<a class="nav-link inverted" href="/zh-hans/help/?ref=gm">帮助</a>
</nav>
</div>
</div>
</div>

显然这只是第一个class为container的div标签。如果要获取所有的话,用的是find_all:

>>> Soup.find_all("dd")
[<dd>Duncan Geere</dd>, <dd>2018年10月26日</dd>]

得到一个标签列表。

当然,如果我要按多个标签提取,怎么办呢?塞一个列表:

>>> paras=Soup.find_all("div",class_=["article-paragraph","article-paragraph--image"])
>>> for para in paras:
    para.find_all(["p","img"])
    

    
[<p>In 1877, French novelist Victor Hugo wrote: “No army can withstand the strength of an idea whose time has come.” Well actually he wrote it in French, but that’s a rough translation. Oh, and he was talking about Louis-Napoleon’s coup d’état of 1851. But you know what? It’s also true about turtles!</p>, <p>When our developers were putting together the Update Aquatic, they knew there was one idea whose time, indeed, had come. It was the brainchild of Reddit user billyK_.</p>, <p>In March 2015 he <a href="https://www.reddit.com/r/Minecraft/comments/2y38ka/mojangive_got_the_perfect_alternative_to_boats/">posted</a> a suggestion for an alternative to boats on the Minecraft subreddit. His solution was turtles.</p>, <p>“For the past 2.5 years, I've pushed gently for turtles to be added,” he <a href="https://www.reddit.com/r/Minecraft/comments/7veimo/turtle_shoutout_to_billyk_as_suspected_theres_a/">wrote</a> earlier this year. “With Update Aquatic happening with no mention of turtles, I pushed a tiny bit harder.” Sure enough, turtles were added soon afterwards. Not only did his idea make him legendary within the Minecraft Reddit community, but it also won him a <a href="https://www.reddit.com/r/Minecraft/comments/7veimo/turtle_shoutout_to_billyk_as_suspected_theres_a/">unique cape</a>. “Many people can suggest Minecraft features, but few become walking Reddit legends,” said Jens Bergensten, Mojang’s Chief Creative Officer no less!</p>]
[<img alt="" class="img-fluid article-image-carousel__image" src="https://community-content-assets.minecraft.net/upload/styles/small/s3/2764065b1322a8d621c597ec2971e9a3-recipe.jpg" title=""/>]
[<p>Turtle shells are a wearable item that let players breathe a little longer underwater. Wearing a turtle shell in a helmet slot, while out of water or in a column of bubbles, will give the player a “water breathing” status effect, which only starts counting down when the player submerges. I guess the extra air is stored in the top of the shell somewhere, but who knows how those turtles weave their bubbly magic?</p>, <p>Turtle shells also give the player two armour points, which is the same amount provided by iron, gold and chain helmets – and a little less than diamond. Aside from the free water breathing, the other big benefit of a turtle shell over a regular iron helmet is durability – a turtle shell will take almost twice as many hits before breaking.</p>, <p>Oh, and the final thing you can do with turtle shells is to use them as brewing ingredients. Mixing a turtle shell with an awkward potion in a brewing stand will get you a Potion of the Turtle Master, which gives you a Resistance III status effect, but also Slowness IV. Never fear, though – slow and steady wins the race. </p>]
[<img alt="" class="img-fluid article-image-carousel__image" src="https://community-content-assets.minecraft.net/upload/styles/small/s3/b4e1641c8ecbbfdf6a363385d2687c9e-realworld.jpg" title=""/>]
[<p>You’re probably wondering how you get one of these magical turtle shells, and your first impulse might be to kill some turtles. DON’T DO IT! Not only are many turtle species endangered, but turtles don’t actually drop turtle shells when they die. Instead, you need to piece together your own turtle shell from the bits of shell – called scutes – that are dropped when a baby turtle grows up into an adult turtle. You can also use those scutes to repair a bashed-up old helmet.</p>, <p>So where do you get baby turtles from? Well, there’s a chance you might find one on a beach, but it’s much easier to breed your own by feeding two turtles seagrass and letting nature take its course. Soon enough, you’ll be looking at a cluster of turtle eggs, which you’ll need to keep safe until they hatch. If they’re broken – by hitting them with a tool, standing on them, letting blocks fall onto them, or allowing hostile mobs to get nearby – then they’ll be destroyed without dropping anything. Keep those babies safe!</p>, <p>If you’re reading this and thinking that you have some great ideas that would make Minecraft better, then here's some good advice from billyK_</p>, <p>“If there's anything to learn from this, it's that if you have a really good idea that the community gets behind, it could one day be in the game,” he <a href="https://www.reddit.com/r/Minecraft/comments/7veimo/turtle_shoutout_to_billyk_as_suspected_theres_a/">wrote</a>. “Just honestly keep expectations in check and don't bug the devs on it. That's like a death sentence for your idea.”</p>, <p>He’s not wrong! But bonus takeaway here: don’t hurt turtles!</p>]

现在配合这些得到正文段落:

>>> sentences=[]
>>> for para in paras:
    for sentence in para.find_all(["p","img"]):
        sentences.append(sentence)
>>> for s in sentences:
    print(s)
    

    
<p>In 1877, French novelist Victor Hugo wrote: “No army can withstand the strength of an idea whose time has come.” Well actually he wrote it in French, but that’s a rough translation. Oh, and he was talking about Louis-Napoleon’s coup d’état of 1851. But you know what? It’s also true about turtles!</p>
<p>When our developers were putting together the Update Aquatic, they knew there was one idea whose time, indeed, had come. It was the brainchild of Reddit user billyK_.</p>
<p>In March 2015 he <a href="https://www.reddit.com/r/Minecraft/comments/2y38ka/mojangive_got_the_perfect_alternative_to_boats/">posted</a> a suggestion for an alternative to boats on the Minecraft subreddit. His solution was turtles.</p>
<p>“For the past 2.5 years, I've pushed gently for turtles to be added,” he <a href="https://www.reddit.com/r/Minecraft/comments/7veimo/turtle_shoutout_to_billyk_as_suspected_theres_a/">wrote</a> earlier this year. “With Update Aquatic happening with no mention of turtles, I pushed a tiny bit harder.” Sure enough, turtles were added soon afterwards. Not only did his idea make him legendary within the Minecraft Reddit community, but it also won him a <a href="https://www.reddit.com/r/Minecraft/comments/7veimo/turtle_shoutout_to_billyk_as_suspected_theres_a/">unique cape</a>. “Many people can suggest Minecraft features, but few become walking Reddit legends,” said Jens Bergensten, Mojang’s Chief Creative Officer no less!</p>
<img alt="" class="img-fluid article-image-carousel__image" src="https://community-content-assets.minecraft.net/upload/styles/small/s3/2764065b1322a8d621c597ec2971e9a3-recipe.jpg" title=""/>
<p>Turtle shells are a wearable item that let players breathe a little longer underwater. Wearing a turtle shell in a helmet slot, while out of water or in a column of bubbles, will give the player a “water breathing” status effect, which only starts counting down when the player submerges. I guess the extra air is stored in the top of the shell somewhere, but who knows how those turtles weave their bubbly magic?</p>
<p>Turtle shells also give the player two armour points, which is the same amount provided by iron, gold and chain helmets – and a little less than diamond. Aside from the free water breathing, the other big benefit of a turtle shell over a regular iron helmet is durability – a turtle shell will take almost twice as many hits before breaking.</p>
<p>Oh, and the final thing you can do with turtle shells is to use them as brewing ingredients. Mixing a turtle shell with an awkward potion in a brewing stand will get you a Potion of the Turtle Master, which gives you a Resistance III status effect, but also Slowness IV. Never fear, though – slow and steady wins the race. </p>
<img alt="" class="img-fluid article-image-carousel__image" src="https://community-content-assets.minecraft.net/upload/styles/small/s3/b4e1641c8ecbbfdf6a363385d2687c9e-realworld.jpg" title=""/>
<p>You’re probably wondering how you get one of these magical turtle shells, and your first impulse might be to kill some turtles. DON’T DO IT! Not only are many turtle species endangered, but turtles don’t actually drop turtle shells when they die. Instead, you need to piece together your own turtle shell from the bits of shell – called scutes – that are dropped when a baby turtle grows up into an adult turtle. You can also use those scutes to repair a bashed-up old helmet.</p>
<p>So where do you get baby turtles from? Well, there’s a chance you might find one on a beach, but it’s much easier to breed your own by feeding two turtles seagrass and letting nature take its course. Soon enough, you’ll be looking at a cluster of turtle eggs, which you’ll need to keep safe until they hatch. If they’re broken – by hitting them with a tool, standing on them, letting blocks fall onto them, or allowing hostile mobs to get nearby – then they’ll be destroyed without dropping anything. Keep those babies safe!</p>
<p>If you’re reading this and thinking that you have some great ideas that would make Minecraft better, then here's some good advice from billyK_</p>
<p>“If there's anything to learn from this, it's that if you have a really good idea that the community gets behind, it could one day be in the game,” he <a href="https://www.reddit.com/r/Minecraft/comments/7veimo/turtle_shoutout_to_billyk_as_suspected_theres_a/">wrote</a>. “Just honestly keep expectations in check and don't bug the devs on it. That's like a death sentence for your idea.”</p>
<p>He’s not wrong! But bonus takeaway here: don’t hurt turtles!</p>

对于正文的内容,如果要保留原句,可以用str直接获得。如果不想要标签,可以用get_text():

>>> str(sentences[2])
'<p>In March 2015 he <a href="https://www.reddit.com/r/Minecraft/comments/2y38ka/mojangive_got_the_perfect_alternative_to_boats/">posted</a> a suggestion for an alternative to boats on the Minecraft subreddit. His solution was turtles.</p>'


>>> sentences[2].get_text()
'In March 2015 he posted a suggestion for an alternative to boats on the Minecraft subreddit. His solution was turtles.'

对于img标签里的属性,可以从attrs里提取:

>>> sentences[4].attrs['src']
'https://community-content-assets.minecraft.net/upload/styles/small/s3/2764065b1322a8d621c597ec2971e9a3-recipe.jpg'
原文地址:https://www.cnblogs.com/KakagouLT/p/9911471.html