爬虫综合大作业

作业要求来自https://edu.cnblogs.com/campus/gzcc/GZCC-16SE1/homework/3159

可以用pandas读出之前保存的数据:见上次博客爬取全部的校园新闻并保存csv

一.把爬取的内容保存到数据库sqlite3

import sqlite3
with sqlite3.connect('gzccnewsdb.sqlite') as db:
newsdf.to_sql('gzccnews',con = db)

with sqlite3.connect('gzccnewsdb.sqlite') as db:
df2 = pd.read_sql_query('SELECT * FROM gzccnews',con=db)

保存到MySQL数据库

    • import pandas as pd
    • import pymysql
    • from sqlalchemy import create_engine
    • conInfo = "mysql+pymysql://user:passwd@host:port/gzccnews?charset=utf8"
    • engine = create_engine(conInfo,encoding='utf-8')
    • df = pd.DataFrame(allnews)
    • df.to_sql(name = ‘news', con = engine, if_exists = 'append', index = False)
    • 二.爬虫综合大作业
      1. 选择一个热点或者你感兴趣的主题。
      2. 选择爬取的对象与范围。
      3. 了解爬取对象的限制与约束。
      4. 爬取相应内容。
      5. 做数据分析与文本分析。
      6. 形成一篇文章,有说明、技术要点、有数据、有数据分析图形化展示与说明、文本分析图形化展示与说明。
      7. 文章公开发布。
        1. 我爬取的主题是虎扑英雄联盟专区,作为一名jr,平时比较喜欢在虎扑这个软件。
        2. 主要代码:

        3. 生成爬虫的函数
          
          def creat_bs(url):
              result = requests.get(url)
              e=chardet.detect(result.content)['encoding']
              #set the code of request object to the webpage's code
              result.encoding=e
              c = result.content
              soup =BeautifulSoup(c,'lxml')
              return soup
          
          接着构建所要获取网页的集合函数:
          
          def build_urls(prefix,suffix):
              urls=[]
              for item in suffix:
                  url=prefix+item
                  urls.append(url)
              return urls
          爬取函数:
          
          
          def find_title_link(soup):
              titles=[]
              links=[]
              try:
                  contanier=soup.find('div',{'class':'container_padd'})
                  ajaxtable=contanier.find('form',{'id':'ajaxtable'})
                  page_list=ajaxtable.find_all('li')
                  for page in page_list:
                      titlelink=page.find('a',{'class':'truetit'})
                      if titlelink.text==None:
                          title=titlelink.find('b').text
                      else:
                          title=titlelink.text
                      if np.random.uniform(0,1)>0.90:
                          link=titlelink.get('href')
                          titles.append(title)
                          links.append(link)
              except:
                  print('have no value')
              return titles,links
          数据并保存:
          
          
          wordlist=str()
          for title in title_group:
              wordlist+=title
          
          for reply in reply_group:
              wordlist+=reply
          
          def savetxt(wordlist):
              f=open('wordlist.txt','wb')
              f.write(wordlist.encode('utf8'))
              f.close()
          savetxt(wordlist)
          词云图的制作:
          
          
          import jieba
          jieba.load_userdict('user_dict.txt')
          wordlist_af_jieba=jieba.cut_for_search(wordlist)
          wl_space_split=' '.join(wordlist_af_jieba)
          
          from wordcloud import WordCloud,STOPWORDS
          import matplotlib.pyplot as plt
          stopwords=set(STOPWORDS)
          fstop=open('stopwords.txt','r')
          for eachWord in fstop:
              stopwords.add(eachWord.decode('utf-8'))
          
          wc=WordCloud(font_path=r'C:WindowsFontsSTHUPO.ttf', background_color='black',max_words=200,width=700,height=1000,stopwords=stopwords,max_font_size=100,random_state=30)
          wc.generate(wl_space_split)
          wc.to_file('hupu_pubg2.png')
          plt.imshow(wc,interpolation='bilinear')
          plt.axis('off')

          爬取的帖子截图:

          词云图:

原文地址:https://www.cnblogs.com/lenkay/p/10836278.html