【大数据应用技术】作业五

【大数据应用技术】作业五｜理解爬虫原理

本次作业的要求来自于：https://edu.cnblogs.com/campus/gzcc/GZCC-16SE2/homework/2851

1. 爬虫原理

2. 理解爬虫开发过程

1).简要说明浏览器工作原理；

2).使用 requests 库抓取网站数据

我们可以利用 requests.get(url) 来获取校园新的html代码

代码如下：

#导入requests库
import requests
from bs4 import BeautifulSoup
url = 'http://news.gzcc.cn/html/2019/xibusudi_0329/11097.html'
news = requests.get(url)
news.encoding = 'utf-8'
print(news.text)

运行结果：

3).了解网页

写一个简单的html文件，包含多个标签，类，id

4).使用 Beautiful Soup 解析网页

通过BeautifulSoup(html_sample,'html.parser')可以把上述html文件解析成DOM Tree

select（选择器）定位数据

代码：

 1 import requests
 2 from bs4 import BeautifulSoup
 3 url = 'http://news.gzcc.cn/html/2019/xibusudi_0329/11097.html'
 4 news = requests.get(url)
 5 news.encoding = 'utf-8'
 6 newSoup = BeautifulSoup(news.text,'html.parser')
 7 #找出含有特定标签的html元素
 8 newSpan = newSoup.select('span');
 9 print('找出含有span标签的html元素:')
10 print(newSpan);
11 #找出含有特定类名的html元素
12 newInfo = newSoup.select('.show-info');
13 print('找出class=show-info的html元素:');
14 print(newInfo);
15 #找出含有特定id名的html元素
16 newContent = newSoup.select('#content')[0].text;
17 print('找出id=content的html元素:');
18 print(newContent);

运行结果如下：

3.提取一篇校园新闻的标题、发布时间、发布单位

url = 'http://news.gzcc.cn/html/2019/xiaoyuanxinwen_0320/11029.html'

代码如下：

 1 import requests
 2 from bs4 import BeautifulSoup
 3 url = 'http://news.gzcc.cn/html/2019/xiaoyuanxinwen_0320/11029.html'
 4 news = requests.get(url)
 5 news.encoding = 'utf-8'
 6 newSoup = BeautifulSoup(news.text,'html.parser')
 7 #标题
 8 title = newSoup.select('.show-title')[0].text
 9 print('标题:'+title);
10 #发布时间
11 newDate = newSoup.select('.show-info')[0].text.split()[0].lstrip('发布时间:')
12 newTime = newSoup.select('.show-info')[0].text.split()[1]
13 newDateTime = newDate+' '+newTime
14 print('发布时间:'+newDateTime);
15 #发布单位
16 source = newSoup.select('.show-info')[0].text.split()[4].lstrip('来源：')
17 print('发布单位:'+source);

运行结果：