千千音乐项目

千千音乐项目

github:https://github.com/Norni/spider_project/tree/master/qianqianyinyue


 

1、千千音乐概述

1.1 目的

  • 通过在程序入口,输入歌手的名字,便能获取该歌手的所有歌曲的详细信息

  • 预览

    python run.py
    # 数据通过文件返回,文件名字为run.py同级目录下song_list.txt文件中
    # 需要说明的是,该链接自带时间戳,隔天会失效
    {
    "song_name": "演员",
    "song_id": "242078437",
    "song_href": "http://music.taihe.com/song/242078437",
    "author_name": "薛之谦",
    "author_tinguid": "2517",
    "author_url": "http://music.taihe.com/artist/2517",
    "song_album": "初学者",
    "song_publish_date": "2016-07-18",
    "song_publish_company": "海蝶(天津)文化传播有限公司",
    "download_url": "http://audio04.dmhmusic.com/71_53_T10040589078_128_4_1_0_sdk-cpm/cn/0206/M00/90/77/ChR47F1_nqiAfD0hAD_MGBybIdk026.mp3?xcode=aefbd591c37efa806a6c2b65cae142a973974a6",
    "song_lrc_link": "http://qukufile2.qianqian.com/data2/lrc/bed1fcb36f51259eefab8ba6d95f524f/672457403/672457403.lrc",
    "mv_download_url_info": {},
    "song_mv_id": null
    }
    ********************
    {
    "song_name": "你还要我怎样",
    "song_id": "100575177",
    "song_href": "http://music.taihe.com/song/100575177",
    "author_name": "薛之谦",
    "author_tinguid": "2517",
    "author_url": "http://music.taihe.com/artist/2517",
    "song_album": "意外",
    "song_publish_date": "2013-11-11",
    "song_publish_company": "华宇世博音乐文化(北京)有限公司",
    "download_url": "http://audio04.dmhmusic.com/71_53_T10038986648_128_4_1_0_sdk-cpm/cn/0208/M00/E5/61/ChR46119DrGAW4d4AEvErRDwLyg867.mp3?xcode=6b725779223e2de86a6bbb3ac7a1959393d994b",
    "song_lrc_link": "http://qukufile2.qianqian.com/data2/lrc/0a6ef3d9a86dd4f1aa782a114d9f288e/672463341/672463341.lrc",
    "mv_download_url_info": {},
    "song_mv_id": null
    }
    ......

     

1.2 开发环境

  • 抓包分析平台:Window

  • 抓包分析软件:Chrome

  • 代码开发平台:Linux

  • 代码开发软件:pycharm

  • 技术栈

    • 发送数据请求:requests

    • 数据存储:mongodb

    • 数据处理:lxml, re, json, pymongo, logging

2、项目设计

2.1 流程设计

2.2 项目流程概述

  1. 项目目录

  2. 项目工作流程概述

    • bin模块

      • run.py为程序入口,通过run.py能够调用core.spiders中的各个爬虫模块,实现功能

      • log.log为日志记录

      • song_list.txt为写入的歌手数据,为用户可视化的效果

    • core模块

      • spiders为爬虫模块

        • artist_total.py获取所有的歌手信息,并存入mongodb数据库

        • author_info_update更新歌手的信息,包括生日,简介等

        • song_total.py获取单个歌手的所有歌曲信息

        • song_info_total_update.py更新歌手的歌曲信息,包括下载链接,mv链接(如果有)

    • utils模块

      • log.py提供log的功能

      • mongo_pool.py操作数据库的功能

      • random_useragent.py提供随机的user_agent

    • settings.py文件

      提供配置

3、模型设计

3.1 歌手模型

  • 效果图

  

   

  • 字段解释

    • author_name歌手名字

    • author_url歌手主页

    • author_tinguid歌手的id

    • author_birthday歌手生日

    • author_constellation歌手星座

    • author_from_area歌手国籍

    • author_gender歌手性别

    • author_hot歌手热度

    • author_image_url歌手图片的url

    • author_intro歌手简介

    • author_share_num歌手主页被分享的次数

    • author_songs_total歌手的总歌曲数量

    • author_stature歌手的身高

    • auhtor_weight歌手的体重

3.2 歌曲模型

  • 效果图

  

  

  • 字段解释

    • author_info歌手的信息

      • author_name歌手名字

      • author_tinguid歌手id

      • author_url歌手主页url

      • 案例

    • song_list歌手的所有歌曲

      • song_name歌曲名字

      • song_id歌曲id

      • song_href歌曲主页

      • author_name歌手名字

      • author_tinguid歌手id

      • author_url歌手主页

      • song_album歌曲所属专辑

      • song_publish_date歌曲发行时间

      • song_publish_company歌曲发行公司

      • download_url歌曲下载地址

      • song_lrc_link歌曲的歌词链接

      • mv_download_url_info歌曲mv的下载信息

      • song_mv_id歌曲mv的id

      • 案例

4、settings.py

  • 说明

    • 实现全局的配置

  • 代码

import logging

# 配置MONGO_URL
MONGO_URL = "mongodb://127.0.0.1:27017"

# 配置log
LOG_FMT = "%(asctime)s %(filename)s [line:%(lineno)d] %(levelname)s:%(message)s"
LOG_DATE_FMT = "%D-%m-%d %H:%M:%S"
LOG_FILENAME = 'log.log'
LOG_LEVEL = logging.DEBUG

# 配置user_agent
USER_AGENT_LIST = [
   "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET4.0C; .NET4.0E; rv:11.0) like Gecko",
   "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36 OPR/26.0.1656.60",
   "Mozilla/5.0 (Windows NT 5.1; U; en; rv:1.8.1) Gecko/20061208 Firefox/2.0.0 Opera 9.50",
   "Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10",
   "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0",
   "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.57.2 (KHTML, like Gecko) Version/5.1.7 Safari/534.57.2",
   "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36",
   "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11",
   "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.16 (KHTML, like Gecko) Chrome/10.0.648.133",
   "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36",
   "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.11 TaoBrowser/2.0",
   "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER",
   "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 UBrowser/4.0.3214.0 Safari/537.36"
]

5、utils模块

5.1 random_useragent.py

  • 说明

    • 实现随机生成一个user_agent的功能

    • 通过property装饰方法,使其可以方便调用

  • 代码

    import random
    from qianqianyinyue.settings import USER_AGENT_LIST


    class RandomUserAgent(object):
       def __init__(self):
           self.user_agent_list = USER_AGENT_LIST

       @property
       def get_random_user_agent(self):
           return random.choice(self.user_agent_list)


    random_user_agent = RandomUserAgent().get_random_user_agent

    if __name__ == "__main__":
       # obj = RandomUserAgent()
       data = {
           "a": random_user_agent
      }
       print(data)

5.2 log.py

  • 说明

    • 文件存储

      • 方便后续查看日志

    • 控制台输出

      • 方便及时发现数据的变动

  • 代码

    import logging
    import sys
    from qianqianyinyue.settings import LOG_FMT, LOG_LEVEL, LOG_DATE_FMT, LOG_FILENAME


    class Logger(object):
       def __init__(self):
           self._logger = logging.getLogger()
           self.formatter = logging.Formatter(fmt=LOG_FMT, datefmt=LOG_DATE_FMT)
           self._logger.addHandler(hdlr=self.get_file_handler(filename=LOG_FILENAME))
           self._logger.addHandler(hdlr=self.get_console_handler())
           self._logger.setLevel(level=LOG_LEVEL)

       def get_file_handler(self, filename):
           file_handler = logging.FileHandler(filename=filename, encoding='utf8')
           file_handler.setFormatter(fmt=self.formatter)
           return file_handler

       def get_console_handler(self):
           console_handler = logging.StreamHandler(stream=sys.stdout)
           console_handler.setFormatter(fmt=self.formatter)
           return console_handler

       @property
       def logger(self):
           return self._logger


    logger = Logger().logger

    if __name__ == "__main__":
       logger.info("哈哈哈")

5.3 mongo_pool.py

  • 说明

    • 将集合设置为变量collection,方便实例化时选择相应的数据

    • 实现了

      • 插入功能,并记录日志

      • 更新功能,并记录日志

        • 条件查询更新

      • 查找单条数据功能

        • 通过特定条件,实现查找单条数据

      • 查找所有数据

        • 返回一个生成器

  • 代码

    from pymongo import MongoClient
    from qianqianyinyue.settings import MONGO_URL
    from qianqianyinyue.utils.log import logger


    class MongoPool(object):

       def __init__(self, collection):
           self.mongo_client = MongoClient(MONGO_URL)
           self.collections = self.mongo_client['qianqianyinyue'][collection]

       def __del__(self):
           self.mongo_client.close()

       def insert_one(self, document):
           self.collections.insert_one(document)
           logger.info("插入了新的数据:{}".format(document))

       def update_one(self, conditions, document):
           self.collections.update_one(filter=conditions, update={"$set": document})
           logger.info("更新了->{}<-的数据".format(conditions))

       def find_one(self, conditions):
           collections = self.collections.find(filter=conditions)
           for item in collections:
               item.pop('_id')
               return item

       def find_all(self):
           collections = self.collections.find()
           for item in collections:
               item.pop('_id')
               yield item

6、core模块

6.0 url分析

  • 首页url分析

    • url:http://music.taihe.com/artist

      • 返回一个html_str

      • 请求体

        • GET请求

        • User-Agent

      • 目标数据

        • author_name

        • author_url

        • author_tinguid

  • 歌手主页url分析

    • url:http://music.taihe.com/artist/2517

      • 请求体

        • GET请求

        • Referer:http://music.taihe.com/aritst

      • 目标数据

        • song_name

        • song_url

        • song_id

        • song_mv_url

        • song_mv_Id

    • url:http://music.taihe.com/data/user/getsongs?start=15&size=15&ting_uid=2517

      • 说明

        • 这是ajax返回的第二页数据,start为当前页数减1后乘15,size固定值,ting_uid为歌手的author_tinguid

        • 返回的数据在data字段中,为xml格式字符串,需要re或lxml来提取数据

        • 如何获取最大页数?通过歌手主页的首页,拿到页面下标的最大值,然后通过该值构造下一页的请求

      • 请求体

        • GET请求

        • Referer:http://music.taihe.com/artist/2517

        • User-Agent

    • url:http://music.taihe.com/data/tingapi/v1/restserver/ting?method=baidu.ting.artist.getInfo&from=web&tinguid=2517

      • 说明

        • 每一个歌手的author_tinguid对应一个json数据

        • 如果同一时刻,发送的请求过多,会封禁IP

          • 设置time.sleep(),能够多拿到一些数据

          • 我将时间控制在1到3秒之间,拿到了更多的数据,如果不是为了立刻拿到数据,不妨将时间设置的长一点,即拿到了数据,也降低了被对方服务器发现的可能

            time.sleep(random.uniform(1,3))
  • 歌曲主页url分析

    • url:http://music.taihe.com/song/242078437

      • 请求体

        • User-Agent

      • 目标数据

        • 所属的专辑

        • 发行时间

        • 发行公司

    • url:http://musicapi.taihe.com/v1/restserver/ting?method=baidu.ting.song.playAAC&format=jsonp&songid=242078437&from=web

      • 请求体

        • Referer:http://music.taihe.com/song/242078437

        • User-Agent

      • 目标数据

        • 下载链接

        • 歌词链接

    • url:http://music.taihe.com/mv/601422013

      • 目标数据

        • mv的idsong_mv_id

    • url:http://musicapi.taihe.com/v1/restserver/ting?method=baidu.ting.mv.playMV&mv_id=XXXX

      • 说明

        • 通过来源url,用正则表达式,解析出mv_id,需要注意的是数据库中的mv_id,可能提取的与目标mv_id是不相符的,如果用那个mv_id,可能提取不到数据

      • 目标数据

        • 下载mv的url

6.1 artist_total.py

  • 代码

    import requests
    from lxml import etree
    from qianqianyinyue.utils.mongo_pool import MongoPool
    from qianqianyinyue.utils.random_useragent import random_user_agent
    from qianqianyinyue.utils.log import logger


    class ArtistTotal(object):
       """
          预先执行的程序,获取该网站曲库,所有的音乐作者的信息
          信息包括:
              作者id: 对应数据库字段--->>>“author_tinguid”
              作者的首页url:对应数据库字段--->>>“author_url”
              作者的名字:对应数据库字段--->>>“author_name”
      """
       def __init__(self):
           self.mongo_pool = MongoPool(collection='artist')
           self.url = "http://music.taihe.com/artist"
           self.user_agent = random_user_agent

       def get_response(self, url, user_agent):
           headers = {
               "User-Agent": user_agent
          }
           try:
               response = requests.get(url=url, headers=headers)
               if response.ok:
                   return response.content.decode()
           except Exception as e:
               logger.warning(e)

       def parse_html_str(self, html_str):
           html_str = etree.HTML(html_str)
           b_li_list = html_str.xpath("//ul[@class="container"]/li[position()>1]")
           for b_li in b_li_list:
               s_li = b_li.xpath('./ul/li')
               for li in s_li:
                   item = dict()
                   item["author_name"] = li.xpath('./a/@title')[0] if li.xpath('./a/@title') else None
                   author_url = "http://music.taihe.com"+li.xpath('./a/@href')[0] if li.xpath('./a/@href') else "-1"
                   item["author_url"] = author_url
                   item["author_tinguid"] = author_url.rsplit('/', 1)[-1]
                   yield item

       def insert_to_mongodb(self, documents):
           for document in documents:
               self.mongo_pool.insert_one(document=document)

       def run(self):
           response = self.get_response(url=self.url, user_agent=self.user_agent)
           author_infos = self.parse_html_str(html_str=response)
           self.insert_to_mongodb(documents=author_infos)


    if __name__ == "__main__":
       obj = ArtistTotal()
       obj.run()

6.2 author_info_update.py

  • 代码

    import requests
    import json
    import time
    import random
    from qianqianyinyue.utils.mongo_pool import MongoPool
    from qianqianyinyue.utils.random_useragent import random_user_agent
    from qianqianyinyue.utils.log import logger
    from pprint import pprint
    
    
    class AuthorInfoUpdate(object):
        """更新数据库中,作者的其他相关信息"""
        def __init__(self):
            self.mongo_pool = MongoPool(collection='artist')
            self.url = "http://music.taihe.com/data/tingapi/v1/restserver/ting?method=baidu.ting.artist.getInfo&from=web&tinguid={}"
            self.user_agent = random_user_agent
    
        def get_response(self, url, user_agent):
            headers = {
                "User-agent": user_agent
            }
            try:
                response = requests.get(url=url, headers=headers)
                # print(response.text)
                if response.ok:
                    response = json.loads(response.content.decode())
                    if "name" in response.keys():
                        return response
                    return dict()
            except Exception as e:
                logger.warning(e)
            time.sleep(random.uniform(1, 3))
    
        def parse_data(self, data):
            item = dict()
            item['author_tinguid'] = data['ting_uid']
            item['author_name'] = data['name']
            item['author_share_num'] = data['share_num']
            item['author_hot'] = data['hot']
            item['author_from_area'] = data['country']
            item['author_gender'] = "男" if data['gender'] == "0" else "女"
            item['author_image_url'] = data['avatar_s1000']
            item["author_intro"] = data['intro']
            item['author_birthday'] = data['birth']
            item['author_constellation'] = data['constellation']
            item['author_stature'] = data['stature']
            item['author_weight'] = data['weight']
            item['author_songs_total'] = data['songs_total']
            return item
    
        def update_data_to_mongodb(self, item):
            conditions = {
                "author_name": item['author_name'],
                'author_tinguid': item['author_tinguid']
            }
            self.mongo_pool.update_one(conditions=conditions, document=item)
    
        def run(self):
            artists = self.mongo_pool.find_all()
            for artist in artists:
                ting_uid = artist['author_tinguid']
                if "author_intro" in artist.keys():
                    pass
                else:
                    url = self.url.format(ting_uid)
                    dict_data = self.get_response(url=url, user_agent=self.user_agent)
                    if dict_data:
                        item = self.parse_data(data=dict_data)
                        self.update_data_to_mongodb(item=item)
    
    
    if __name__ == "__main__":
        obj = AuthorInfoUpdate()
        obj.run()
    

6.3 song_total.py

  • 代码

    import requests
    from lxml import etree
    import json
    import time
    import random
    from qianqianyinyue.utils.mongo_pool import MongoPool
    from qianqianyinyue.utils.random_useragent import random_user_agent
    from qianqianyinyue.utils.log import logger
    from pprint import pprint
    
    
    class SongTotal(object):
        """根据作者的名字,获取作者的所有音乐作品"""
    
        def __init__(self):
            self.mongo_pool_ = MongoPool(collection="artist")
            self.mongo_pool = MongoPool(collection="songs")
            self.user_agent = random_user_agent
    
        def get_author_info(self, author_name):
            conditions = {"author_name": author_name}
            author_info = self.mongo_pool_.find_one(conditions)
            if author_info:
                author_info = {
                    "author_name": author_info['author_name'],
                    "author_tinguid": author_info['author_tinguid'],
                    "author_url": author_info['author_url']
                }
                return author_info
            return None
    
        def get_requests(self, url, referer=None):
            headers = {
                "User-Agent": self.user_agent,
                "Referer": referer,
            }
            try:
                response = requests.get(url=url, headers=headers)
                if response.ok:
                    html_str = response.content.decode()
                    return html_str
                return None
            except Exception as e:
                logger.warning(e)
            time.sleep(random.uniform(0, 2))
    
        def parse_one_li(self, li):
            item = dict()
            song_name = li.xpath('./div[contains(@class,"songlist-title")]/span[@class="songname"]/a[1]/@title')
            item["song_name"] = song_name[0] if song_name else None
            song_href = li.xpath('./div[contains(@class,"songlist-title")]/span[@class="songname"]/a[1]/@href')
            song_id = song_href[0].rsplit('/', 1)[-1] if song_href else None
            item["song_id"] = song_id
            item['song_href'] = "http://music.taihe.com" + song_href[0] if song_href else "-1"
            # mv_id获取的有误
            song_mv_href = li.xpath('./div[contains(@class,"songlist-title")]/span[@class="songname"]/a[2]/@href')
            if song_mv_href:
                song_mv_id = song_href[0].rsplit('/', 1)[-1]
                item["song_mv_id"] = song_mv_id
                item['song_mv_href'] = "http://music.taihe.com" + song_mv_href[0]
            return item
    
        def parse_firstpage_html_str(self, html_str):
            html_str = etree.HTML(html_str)
            page_count = html_str.xpath(
                '//div[@class="list-box song-list-box active"]//div[@class="page_navigator-box"]//div[@class="page-inner"]/a[last()-1]/text()')
            page_count = int(page_count[0]) if page_count else -1
            li_list = html_str.xpath('//div[contains(@class,"song-list-wrap")]/ul/li')
            song_list = list()
            for li in li_list:
                item = self.parse_one_li(li=li)
                song_list.append(item)
            return song_list, page_count
    
        def parse_nextpage_html_str(self, data):
            data = json.loads(data)
            html_str = data["data"]["html"] if data else None
            if html_str:
                html_str = etree.HTML(html_str)
                li_list = html_str.xpath('//div[contains(@class, "song-list")]/ul/li')
                song_list = list()
                for li in li_list:
                    item = self.parse_one_li(li=li)
                    song_list.append(item)
                return song_list
    
        def parse_song_list(self, song_list, author_info):
            # 处理歌曲,构造存储结构
            song_ = {
                "author_info": author_info,
                "song_list": [],
            }
            for song in song_list:
                song.update(author_info)
                song_['song_list'].append(song)
            return song_
    
        def construct_url_list(self, ting_uid, page_count):
            base_url = "http://music.taihe.com/data/user/getsongs?start={}&size=15" + "&ting_uid={}".format(
                ting_uid)
            url_list = [base_url.format(i * 15) for i in range(1, page_count)]
            return url_list
    
        def insert_song_to_mongodb(self, song_):
            conditions = {"author_info.author_tinguid": song_["author_info"]["author_tinguid"]}
            song = self.mongo_pool.find_one(conditions)
            if not song:
                self.mongo_pool.insert_one(document=song_)
    
        def run(self, author_name=None):
            author_info = self.get_author_info(author_name)
            if author_info is None:
                print("该歌手不存在,请确认无误后再输入")
            else:
                # 验证数据库,有没有这个歌手的数据
                song__ = self.mongo_pool.find_one({'author_info.author_tinguid': author_info['author_tinguid']})
                if song__:
                    print("歌手数据已存在,无需再发送请求")
                else:
                    firstpage_html_str = self.get_requests(url=author_info["author_url"])
                    if firstpage_html_str:
                        # 获取首页歌曲信息
                        song_list, page_count = self.parse_firstpage_html_str(html_str=firstpage_html_str)
                        # 该页面采用ajax接口,需要设计下一页url
                        url_list = self.construct_url_list(author_info['author_tinguid'], page_count)
                        for url in url_list:
                            # 发送单页请求
                            nextpage_html_data = self.get_requests(url=url, referer=author_info["author_url"])
                            song_list_next = self.parse_nextpage_html_str(data=nextpage_html_data)
                            song_list.extend(song_list_next)
                        song_ = self.parse_song_list(song_list, author_info)
                        self.insert_song_to_mongodb(song_)
    
    
    if __name__ == "__main__":
        obj = SongTotal()
        obj.run(author_name="许嵩")
    

6.4 song_info_total_update.py

  • 代码

    import requests
    import time
    import random
    import json
    import re
    from lxml import etree
    from qianqianyinyue.utils.mongo_pool import MongoPool
    from qianqianyinyue.utils.log import logger
    from qianqianyinyue.utils.random_useragent import random_user_agent
    from pprint import pprint
    
    
    class SongInfoTotalUpdate(object):
        """
        根据作者的名字,更新该歌手的所有歌曲信息
        添加的信息有:
            发行日期
            发行公司
            歌曲下载链接及相关
            mv下载链接及相关
        """
        def __init__(self):
            self.mongo_pool = MongoPool(collection="songs")
            self.user_agent = random_user_agent
    
        def get_song_list(self, author_name):
            conditions = {"author_info.author_name": author_name}
            song_info = self.mongo_pool.find_one(conditions)
            song_list = song_info['song_list']
            return song_list
    
        def get_requests(self, url, referer=None):
            headers = {
                "User-Agent": self.user_agent,
                "Referer": referer,
            }
            try:
                response = requests.get(url=url, headers=headers)
                if response.ok:
                    html_str = response.content.decode()
                    return html_str
                return None
            except Exception as e:
                logger.warning(e)
            time.sleep(random.uniform(0, 2))
    
        def parse_html_str(self, html_str):
            html_str = etree.HTML(html_str)
            song_album = html_str.xpath('//div[@class="song-info-box fl"]/p[contains(@class, "album")]/a/text()')
            song_album = song_album[0] if song_album else None
            song_publish = html_str.xpath('//div[@class="song-info-box fl"]/p[contains(@class,"publish")]/text()')
            song_publish_date = song_publish[0].split(r":", 1)[-1].strip() if song_publish else None
            song_company = html_str.xpath('//div[@class="song-info-box fl"]/p[contains(@class,"company")]/text()')
            song_publish_company = song_company[0].split(r":", 1)[-1].strip() if song_company else None
            return song_album, song_publish_date, song_publish_company
    
        def parse_data(self, data):
            data = json.loads(data)
            if "bitrate" in data.keys():
                download_url = data['bitrate']['file_link']
                song_lrc_link = data['songinfo']['lrclink']
                return download_url, song_lrc_link
            return None, None
    
        def parse_mv_data(self, data):
            data = json.loads(data)
            mv_download_info = list()
            if "result" in data.keys():
                if "files" in data['result'].keys():
                    file_info = data['result']['files']
                    for k, v in file_info.items():
                        item = dict()
                        item['definition_name'] = v["definition_name"]
                        item['file_link'] = v['file_link']
                        mv_download_info.append(item)
            return mv_download_info
    
        def parse_mv_html_str(self, mv_html_str):
            mv_id = re.findall(r'href="/playmv/(.*?)"', mv_html_str, re.S)
            return mv_id[0] if mv_id else None
    
        def update_songs_info(self, info):
            conditions = {"author_info.author_tinguid": info["author_tinguid"]}
            songs_info = self.mongo_pool.find_one(conditions)
            for song_ in songs_info['song_list']:
                if info["song_id"] == song_['song_id']:
                    song_.update(info)
            self.mongo_pool.update_one(conditions, songs_info)
    
        def get_content(self, song):
            global song_album, song_publish_date, song_publish_company
            mv_download_url_info = dict()
            song_href = song['song_href']
            html_str = self.get_requests(url=song_href)
            if html_str:
                song_album, song_publish_date, song_publish_company = self.parse_html_str(html_str)
            url = "http://musicapi.taihe.com/v1/restserver/ting?method=baidu.ting.song.playAAC&format=jsonp&songid={}&from=web".format(
                song['song_id'])
            data = self.get_requests(url=url, referer=song['song_href'])
            download_url, song_lrc_link = self.parse_data(data)
            if song.get("song_mv_href"):
                mv_html_str = self.get_requests(url=song['song_mv_href'])
                mv_id = self.parse_mv_html_str(mv_html_str)
                if mv_id:
                    song["mv_id"] = mv_id
                    mv_data_url = "http://musicapi.taihe.com/v1/restserver/ting?method=baidu.ting.mv.playMV&mv_id={}".format(
                        song['mv_id'])
                    mv_data = self.get_requests(url=mv_data_url, referer=song["song_mv_href"])
                    mv_download_url_info = self.parse_mv_data(mv_data)
            info = {
                "song_album": song_album,
                "song_publish_date": song_publish_date,
                "song_publish_company": song_publish_company,
                "download_url": download_url,
                "song_lrc_link": song_lrc_link,
                "mv_download_url_info": mv_download_url_info,
                'song_id': song['song_id'],
                "song_mv_id": song['song_mv_id'] if song.get('song_mv_id') else None,
                "author_tinguid": song["author_tinguid"],
            }
            return info
    
        def run(self, author_name=None):
            if author_name:
                song_list = self.get_song_list(author_name)
                for song in song_list:
                    info = self.get_content(song)
                    self.update_songs_info(info)
            else:
                print("请输入歌手名字")
    
    
    if __name__ == "__main__":
        obj = SongInfoTotalUpdate()
        obj.run(author_name="许嵩")
    

7、bin模块

启动模块,负责串联功能模块

7.1 run.py

  • 代码

    import json
    import os
    from qianqianyinyue.utils.mongo_pool import MongoPool
    # from qianqianyinyue.utils.log import logger
    from qianqianyinyue.core.spiders.artist_total import ArtistTotal
    from qianqianyinyue.core.spiders.author_info_update import AuthorInfoUpdate
    from qianqianyinyue.core.spiders.song_total import SongTotal
    from qianqianyinyue.core.spiders.song_info_total_update import SongInfoTotalUpdate
    
    
    class ProcessConsole():
        def __init__(self):
            self.mongo_pool_ = MongoPool(collection="artist")
            self.mongo_pool = MongoPool(collection="songs")
    
        def _save_info_to_file(self, conditions):
            song_info = self.mongo_pool.find_one(conditions)
            if song_info:
                with open('song_list.txt', 'w+', encoding='utf8') as f:
                    for song in song_info["song_list"]:
                        f.write(json.dumps(song, ensure_ascii=False, indent=2))
                        f.write(os.linesep)
                        f.write('*'*20)
                        f.write(os.linesep)
                return "写入文件完成"
            return "写入文件出错:未找到该数据"
    
        def _check_artist_data(self):
            artist = self.mongo_pool_.find_all()
            if not artist:
                ArtistTotal().run()
                AuthorInfoUpdate().run()
                return "数据加载完成,欢迎使用"
            return "数据加载完成,欢迎使用"
    
        def _check_author_by_author_name(self, author_name):
            conditions = {"author_name": author_name}
            artist = self.mongo_pool_.find_one(conditions)
            if artist:
                conditions = {"author_info.author_name": author_name}
                if self.mongo_pool.find_one(conditions):
                    print("找到了该歌手数据,正在更新数据,请稍后查收文件song_list.txt")
                    SongInfoTotalUpdate().run(author_name)
                    self._save_info_to_file(conditions)
                else:
                    print("找到了该歌手数据,正在下载数据,请稍后查收文件song_list.txt")
                    SongTotal().run(author_name)
                    SongInfoTotalUpdate().run(author_name)
                    self._save_info_to_file(conditions)
            else:
                print("数据库中,无法找到该歌手,请核对无误后再输入")
    
        def run(self):
            print("欢迎使用歌曲查找系统,该系统能通过歌手名字,查找该歌手的所有歌曲信息")
            print("数据加载中...请稍等...")
            message = self._check_artist_data()
            print(message)
            author_name = input("请输入要查找的歌手名字:")
            self._check_author_by_author_name(author_name)
    
    
    if __name__ == "__main__":
        ProcessConsole().run()

 

8、总结

  • 总结

    • 通过chrome实现抓包分析,查找数据来源入口

    • 歌手罗列页返回一个html_str文件,通过lxml和re可提取相关信息

    • 在歌手详情页,会返回两个数据,一个数据返回包括简介类的信息,一个数据返回包括发行公司类的信息

      • 经过研究,歌手详情页首页返回一个htm_str数据,通过这个html_str能够拿到相应的发行公司数据,所以我直接通过的该html_str拿到的发行公司类的数据

      • 点击下一页时,发起的ajax请求,通过构造相应的url,可拿到相应的数据,注意返回的json数据中,data内返回的是xml格式字符串,可通过lxml格式化后,提取数据。

      • 将提取到的歌手简介类的信息,更新到mongodb相应的集合中

      • 将提取到的歌曲信息,格式化处理后,存储到mongodb相应的集合中

      • 以上,拿到该歌手所有的歌曲信息和该歌手的个人信息

    • 歌曲详情页,返回json数据,其中包括歌曲的下载链接,和歌词的数据链接

    • 用获取到的mv_url,来到mv首页,能够拿到mv的数据链接

    • 最后通过bin模块下的run.py来实现与用户的交互

  • 改进

    • 可以增加协程来减少数据获取的时间

    • 该域名下,如果爬取速度过快,会封禁ip,所以可以增加proxy反爬虫中间件

    • 用户目前只能通过歌手,来获取歌曲信息,可以细化到单个歌曲的获取及变更

      • 方案一,通过song_id,来构造请求,只更新局部,速度比较快

      • 方案二,利用现接口,然后更改数据库查找方式,但这样也意味着获取资源的速度的变慢

    • 构造web-api,可视化操作呈现

原文地址:https://www.cnblogs.com/nuochengze/p/13322625.html