beautifulsoap爬虫

从html文件读

from bs4 import BeautifulSoup
html_doc="文件地址"
html_file=open(html_doc,"r")
html_handle=html_file.read()
soup=BeautifulSoup(html_handle,'html.parser') #选择解析方法
print(soup)

从网页读

from bs4 import BeautifulSoup
import requests
url="http://www.cnblogs.com/j-c-y/p/11129345.html"
page=requests.get(url).text
soup=BeautifulSoup(page,'html.parser') #选择解析方法
result=soup.find_all(id="blog-calendar") #寻找对应id的条目
print(result)
r=re.findall("".*"",str(result))
print(len(r))
原文地址:https://www.cnblogs.com/j-c-y/p/11454855.html