爬取全部的校园新闻

作业来源于:https://edu.cnblogs.com/campus/gzcc/GZCC-16SE1/homework/3002

0.从新闻url获取点击次数,并整理成函数

  • newsUrl
  • newsId(re.search())
  • clickUrl(str.format())
  • requests.get(clickUrl)
  • re.search()/.split()
  • str.lstrip(),str.rstrip()
  • int
  • 整理成函数
  • 获取新闻发布时间及类型转换也整理成函数

1.从新闻url获取新闻详情: 字典,anews

2.从列表页的url获取新闻url:列表append(字典) alist

3.生成所页列表页的url并获取全部新闻 :列表extend(列表) allnews

*每个同学爬学号尾数开始的10个列表页

4.设置合理的爬取间隔

import time

import random

time.sleep(random.random()*3)

5.用pandas做简单的数据处理并保存

保存到csv或excel文件 

newsdf.to_csv(r'F:duym爬虫gzccnews.csv')


核心代码:

import re
import time
import random
import requests
from datetime import datetime
from bs4 import BeautifulSoup
import pandas as pd

global news_List
news_List = []

def pageUrl(url):
    res = requests.get(url)  # requests后面的方法要根据网页的请求信息来判断
    res.encoding = 'utf-8'  # 爬虫结果乱码,可以用UTF-8解码更正
    soup = BeautifulSoup(res.text, 'html.parser')
    for news in soup.select('li'):
        if len(news.select(".news-list-title"))>0:    #选择有爬取内容的class
           news_href = news.select("a")[0]['href']
           news_desc = news.select(".news-list-description")[0].text
           news_dict = newspages(news_href)   
           news_dict['url'] = news_href #存储url
           news_dict['desc'] = news_desc
           news_List.append(news_dict)



#用正则表达式截取对应ID放入对应URL,最后转成为URL
def click(news_url):
    num = re.findall('d+', news_url)[-1]
    url = 'http://oa.gzcc.cn/api.php?op=count&id={}&modelid=80'.format(num)
    res_click = requests.get(url).text  # 不用解析直接获取文本
    res_click = int(res_click.split('.html')[-1][2:5])  # 最后一个html括号里面放置的是总点击数,同时转整型
    return res_click

#切片日期并转用函数转型为date类型
def newsdate(showinfo):
    # 把选择class=show-info的部分html代码截取,进行切片清洗,把时间用datetime方法处理
    soup1 = showinfo.split()
    date = soup1[0].split(':')[1]  # 把发布时间:XXXX另外切片得到要的日期
    time = soup1[1]
    DateTime = datetime.strptime(date + ' ' + time, '%Y-%m-%d %H:%M:%S')
    return DateTime

#爬取新闻发布时间、标题、日期与点击量
def newspages(url):
    newsdetail = {}
    res = requests.get(url)
    res.encoding = 'utf-8'
    soup = BeautifulSoup(res.text, 'html.parser')
    newsdetail['newstitle'] = soup.select('.show-title')[0].text
    showinfo = soup.select('.show-info')[0].text
    newsdetail['newsDATE'] = newsdate(showinfo)  #转化为date类型
    newsdetail['newsclick'] = click(url)
    return newsdetail


for i in range(3,13):
  time.sleep(random.random() * 3)
  url="http://news.gzcc.cn/html/xiaoyuanxinwen/{}.html".format(i)
  pageUrl(url)

for i in news_List:
   print(i)
news_CSV = pd.DataFrame(news_List)
news_CSV.to_csv(r'D:work.csv',encoding='utf-8')

     

      运行截图:

      生成的work.csv:

 

 

 

原文地址:https://www.cnblogs.com/lqscmz/p/10709356.html