爬虫大作业

通过爬取2345电影网的电影信息，通过电影的类型和评分分别生成相应的词云进行数据分析

一、准备过程

首先打开2345电影网的热播电影区，网址是https://dianying.2345.com/list/------.html

在这里可以通过审查模式看到第一页的详细信息，而目的则是通过爬取热播页面的每个电影的评分和类型来分析最近影迷的观影需求

环境如下：

　　python3.6.2　　　　PyCharm

　　Windows7　　第三方库（jieba,wordcloud，bs4，Requests，re，wordcloud)

二、代码

1.用requests库和BeautifulSoup库，爬取电影网当前页面的每个电影的电影名、评分、简介、地区、年代、链接等，将获取电影详情的代码定义成一个函数 def getNewDetail(newsUrl):

import  requests
import re
from bs4 import BeautifulSoup
from datetime import datetime
import pandas
new={}

def getNewsDetail(newsUrl):
   resd = requests.get(newsUrl)
   resd.encoding = 'gb2312'
   soupd = BeautifulSoup(resd.text, 'html.parser')
   print(newsUrl)
   news = {}
   news['title']=soupd.select('.tit')[0].select('h1')[0].text     # 标题
   news['emSorce']=float(soupd.select('.tit')[0].select('.pTxt')[0].text.split('ue60e')[0].rstrip('分'))   #评分
   # s=soupd.select('.li_4')[1].text.split()[1:]
   # if(len(s)>0):
   #    str=''
   #    for i in s:
   #       str=str+i
   #    print(str)
   # writeNewsDetail(str)
   news['content:']=soupd.select('.sAll')[0].text.strip()    #内容
   s = soupd.select('.wholeTxt')[0].text
   news['area'] = s[s.find('地区：'):s.find('语言：')].split()[1]  # 地区

   news['tit']=soupd.select('.li_4')[3].text.strip().split()[1]  #年代
   if len(soupd.select('.li_4')[3].text)>0:
      news['tit']=soupd.select('.li_4')[3].text.strip().split()[1]
   else:
      news['tit']=='none'
   new[soupd.select('.tit')[0].select('h1')[0].text ]= news['emSorce']

   return news

2.取出一个电影列表页的全部电影包装成函数def getListPage(pageUrl):

def getListPage(pageUrl):
   res = requests.get(pageUrl)
   res.encoding = 'gb2312'
   soup = BeautifulSoup(res.text, 'html.parser')
   newslist = []
   for i in soup.select('.v_tb')[0].select('li'):
      if (len(i.select('a')) > 0):
         newsUrl = 'https:' + i.select('a')[0].attrs['href']
         newslist.append(getNewsDetail(newsUrl))
   return newslist

3.获取总的页数，算出电影总页数包装成函数def getPageN():

def getPageN():
   res = requests.get('https://dianying.2345.com/list/-------.html')
   res.encoding = 'gb2312'
   soup = BeautifulSoup(res.text, 'html.parser')
   pagenumber=int(soup.select('.v_page')[0].text.split('...')[1].rstrip('>')[0:3])
   return pagenumber

4.获取全部电影的详情信息（爬取页面前5页，原因在下方）

newstotal=[]
pageUrl='https://dianying.2345.com/list/-------.html'
newstotal.extend(getListPage(pageUrl))
n=getPageN()
for i in range(2,4):
   listPageUrl = 'https://dianying.2345.com/list/-------{}.html'.format(i)
   newstotal.extend(getListPage(listPageUrl))

df=pandas.DataFrame(newstotal)
dfs=df.sort_index(by='emSorce',ascending=False)
dfsn=dfs[['title','emSorce']]
dfsn.to_excel('gzcc.xlsx',encoding='utf-8')

将爬取到所有信息通过pandas根据评分排序，然后只爬取'title'和'emScore'两列的信息，并保存至excel表中

将爬取到的电影类别通过构造方法writeNewsDetail(content)写入到文本gzccnews.txt中，通过jieba分词，结果存至jieduo.txt中

def writeNewsDetail(content):
    f=open('gzccnews.txt','a',encoding='utf-8')
    f.write(content)
    f.close()

import jieba
f=open('gzccnews.txt','r',encoding="UTF-8")
str1=f.read()
f.close()
str2=list(jieba.cut(str1))
countdict = {}
for i in str2:
    countdict[i] = str2.count(i)
dictList = list(countdict.items())
dictList.sort(key = lambda x:x[1],reverse = True)
f = open("E:/jieduo.txt", "a")
for i in range(30):
 f.write('
' + dictList[i][0] + " " + str(dictList[i][1]))
f.close()

三、生成词云

通过导入wordcloud的包，来生成词云

import wordcloud
from PIL import Image,ImageSequence
import numpy as np
import matplotlib.pyplot as plt
from wordcloud import WordCloud,ImageColorGenerator

image= Image.open('./63f7756ec660349a67874cfae7fc8643.jpg')
graph = np.array(image)
font=r'C:WindowsFontssimhei.TTF'
wc = WordCloud(font_path=font,background_color='White',max_words=50,mask=graph)
wc.generate_from_frequencies(countdict)
image_color = ImageColorGenerator(graph)
plt.imshow(wc)
plt.axis("off")
plt.show()

选择的图片：

生成词云的结果：

1.通过评分生成：

2.通过类型生成：

四、遇到的问题及解决方案

1.在爬取电影信息的时候，爬取电影年代的时候，会因为当前列表中当前列爬到某一处时，列表中的值没有，会报index out of range，上网搜索并不是下标越界的原因，这时候就尝试对当前的空值进行判断，如下图

# news['tit']=soupd.select('.li_4')[3].text.strip().split()[1]  #年代
   # if len(soupd.select('.li_4')[3].text.strip().split())==1:
   #    news['tit']=soupd.select('.li_4')[3].text.strip().split()[1]
   # else:
   #    news['tit']=='none'

在加了判断后，可以输出一部分，但是在爬取所有页的时候还是会报错，找不到相应的错误信息，只好爬取前面近期的几页进行分析，一共140多部电影信息

2.在导入wordcloud这个包的时候，会遇到很多问题

首先通过使用pip install wordcloud这个方法在全局进行包的下载，可是最后会报错误error: Microsoft Visual C++ 14.0 is required. Get it with “Microsoft Visual C++ Build Tools”: http://landinghub.visualstudio.com/visual-cpp-build-tools

这需要我们去下载VS2017中的工具包，但是网上说文件较大，所以放弃。

之后尝试去https://www.lfd.uci.edu/~gohlke/pythonlibs/#wordcloud下载whl文件，然后安装。

下载对应的python版本进行安装，如我的就下载wordcloud-1.4.1-cp36-cp36m-win32.whl,wordcloud-1.4.1-cp36-cp36m-win_amd64

两个文件都放到项目目录中，两种文件都尝试安装

通过cd到这个文件的目录中，通过pip install wordcloud-1.4.1-cp36-cp36m-win_amd64,进行导入

但是两个尝试后只有win32的能导入，64位的不支持，所以最后只能将下好的wordcloud放到项目lib中，在Pycharm中import wordcloud,最后成功

五、数据分析与结论

通过对词云的查看，可以看出最近一年的影迷对于电影类型为动作、喜剧、爱情、犯罪的题材的电影喜欢，而对恐怖、历史、灾难等题材的电影选择很少，这说明观看电影选择的大多数是有关有趣一点的，而对于偏阴暗面的电影少选择，这样在拍摄电影时可以通过受众程度来拍摄。

在对最近热播的电影生成的词云来看，评分高的电影跟类型有很大的关系，所以影迷可以通过选择评分高的电影进行观看。

在这次作业中，我了解并实现如何爬取一个网站的有用信息，如何对爬取的信息分析并得到结论，但是对于大数据技术深度的技术并不了解，基础的知识还需要我不断加深巩固。

六、所有代码

# 大数据大作业
# 爬取2345电影网中的电影评分最多的电影

import  requests
import re
from bs4 import BeautifulSoup
from datetime import datetime
import pandas
new={}

def writeNewsDetail(content):
    f=open('gzccnews.txt','a',encoding='utf-8')
    f.write(content)
    f.close()
def getNewsDetail(newsUrl):
   resd = requests.get(newsUrl)
   resd.encoding = 'gb2312'
   soupd = BeautifulSoup(resd.text, 'html.parser')
   # print(newsUrl)
   news = {}
   news['title']=soupd.select('.tit')[0].select('h1')[0].text     # 标题
   news['emSorce']=float(soupd.select('.tit')[0].select('.pTxt')[0].text.split('ue60e')[0].rstrip('分'))   #评分
   s=soupd.select('.li_4')[1].text.split()[1:]
   if(len(s)>0):
      str=''
      for i in s:
         str=str+i
      # print(str)
   writeNewsDetail(str)
   # news['content:']=soupd.select('.sAll')[0].text.strip()    #内容
   # s = soupd.select('.wholeTxt')[0].text
   # news['area'] = s[s.find('地区：'):s.find('语言：')].split()[1]  # 地区

   # news['tit']=soupd.select('.li_4')[3].text.strip().split()[1]  #年代
   # if len(soupd.select('.li_4')[3].text)>0:
   #    news['tit']=soupd.select('.li_4')[3].text.strip().split()[1]
   # else:
   #    news['tit']=='none'
   new[soupd.select('.tit')[0].select('h1')[0].text ]= news['emSorce']

   return news
   # print(news)

def getListPage(pageUrl):
   res = requests.get(pageUrl)
   res.encoding = 'gb2312'
   soup = BeautifulSoup(res.text, 'html.parser')
   newslist = []
   for i in soup.select('.v_tb')[0].select('li'):
      if (len(i.select('a')) > 0):
         newsUrl = 'https:' + i.select('a')[0].attrs['href']
         newslist.append(getNewsDetail(newsUrl))
   return newslist

def getPageN():
   res = requests.get('https://dianying.2345.com/list/-------.html')
   res.encoding = 'gb2312'
   soup = BeautifulSoup(res.text, 'html.parser')
   pagenumber=int(soup.select('.v_page')[0].text.split('...')[1].rstrip('>')[0:3])
   return pagenumber

newstotal=[]
pageUrl='https://dianying.2345.com/list/-------.html'
newstotal.extend(getListPage(pageUrl))
n=getPageN()
for i in range(2,4):
   listPageUrl = 'https://dianying.2345.com/list/-------{}.html'.format(i)
   getListPage(listPageUrl)

# getListPage(pageUrl)
# df=pandas.DataFrame(newstotal)
# dfs=df.sort_index(by='emSorce',ascending=False)
# dfsn=dfs[['title','emSorce']]
# dfsn.to_excel('gzcc.xlsx',encoding='utf-8')

# import jieba
# f=open('gzccnews.txt','r',encoding="UTF-8")
# str1=f.read()
# f.close()
# str2=list(jieba.cut(str1))
# countdict = {}
# for i in str2:
#     countdict[i] = str2.count(i)
# dictList = list(countdict.items())
# dictList.sort(key = lambda x:x[1],reverse = True)
# f = open("E:/jieduo.txt", "a")
# for i in range(30):
#  f.write('
' + dictList[i][0] + " " + str(dictList[i][1]))
# f.close()

import wordcloud
from PIL import Image,ImageSequence
import numpy as np
import matplotlib.pyplot as plt
from wordcloud import WordCloud,ImageColorGenerator
image= Image.open('./63f7756ec660349a67874cfae7fc8643.jpg')
graph = np.array(image)
font=r'C:WindowsFontssimhei.TTF'
wc = WordCloud(font_path=font,background_color='White',max_words=50,mask=graph)
wc.generate_from_frequencies(new)
image_color = ImageColorGenerator(graph)
plt.imshow(wc)
plt.axis("off")
plt.show()