QQ空间动态爬虫

作者：虚静
链接：https://zhuanlan.zhihu.com/p/24656161
来源：知乎
著作权归作者所有。商业转载请联系作者获得授权，非商业转载请注明出处。

先说明几件事：

题目的意思是，用于获取“QQ空间动态”的爬虫，而不是”针对QQ空间“的”动态爬虫“
这里的QQ空间动态，特指“说说”
程序是使用cookie登录的。所以如果是想知道如何使用爬虫根据QQ号和密码来实现登录的朋友可以把页面关了
本程序用python3实现，具体版本为python3.5，唯一需要用到的第三方库是requests
程序代码获取方式在最后面

－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－

程序主要由三部分构成，它们分别对应着本爬虫的三个步骤。

1. 获取所有QQ好友信息

间接获取。先把QQ空间的访问权限设置为仅QQ好友可访问

点保存后，上方会出现“当前权限下，XXX好友可以访问你的空间”的提示，如上图。此时打开F12，切换到JavaScript监测窗口。点击上图中画下划线的那几个字，就可以发现浏览器发送了一个GET请求，在Firebug中看到是这样的：

查看它的response，会发现里面就是由自己好友的名字和QQ号码组成的近似于JSON格式的内容。爬虫程序中的get_my_friends.py就是用于获取它的内容的，其主要代码如下：

    def get_friends(self):
        key = True
        position = 0
        while key:
            url = self.base_url + '&offset=' + str(position)
            referer = 'http://qzs.qq.com/qzone/v8/pages/setting/visit_v8.html'
            self.headers['Referer'] = referer

            print("	Dealing with position	%d." % position)
            res = requests.get(url, headers=self.headers)
            html = res.text
            with open('friends/offset' + str(position) + '.json', 'w') as f:
                f.write(html)

            # 检查是否已经全部都获取完，如果是的话
            # uinlist对应的是一个空列表
            with open('friends/offset' + str(position) + '.json') as f2:
                con = f2.read()
            if '''"uinlist":[]''' in con:
                print("Get friends Finish")
                break

            position += 50

2. 获取所有好友的QQ号码

这一步其实只是文本处理，或者说是字符串处理而已。把上一步中保存好的文件进行处理，从中提取好友的QQ号码和名称，将其保存在一个文件中（其名为qqnumber.inc）。由于其内容本身近于字典形式，所以稍加处理，将其转成字典，再进行处理。处理程序为爬虫程序中的get_qq_number.py，主要代码如下：

def exact_qq_number(self):
    friendsFiles = [x for x in os.listdir('friends') if x.endswith("json")]

    qqnumber_item = []
    i = 0
    for each_file in friendsFiles:
        with open('friends/' + each_file) as f:
            source = f.read()
            con_dict = source[75:-4].replace('
', '')
            con_json = json.loads(con_dict)
            friends_list = con_json['uinlist']

            # Get each item from friends list, each item is a dict
            for item in friends_list:
                i = i + 1
                qqnumber_item.append(item)
    else:
        with open('qqnumber.inc', 'w') as qqfile:
            qqfile.write(str(qqnumber_item))

3. 分别获取每个好友的空间动态（说说）

获取好友的说说，方法类似于第1步。先打开F12，保持在默认的All选项卡下就行。再打开好友的空间，点开他们的说说主页，此时可以在请求列表中找到一个URL中包含emotion_cgi_msglist的请求，根据名字就可以猜到，它就是我们要的信息了。然后我们可以模拟这个请求，获取返回的内容并保存。爬虫程序中的get_moods.py就用于此。

此程序文件中包含两个类：Get_moods_start()、Get_moods()。后者实现发送HTTP请求并获取返回内容、保存内容，前者用于把QQ号传到后者的方法中进行处理、控制循环、处理异常。Get_moods()功能实现的主要方法代码如下：

def get_moods(self, qqnumber):
    '''Use cookie and header to get moods file and save it to result folder with QQnumber name'''

    referer = 'http://user.qzone.qq.com/' + qqnumber
    self.headers['Referer'] = referer

    # Create a folder with qq number to save it's result file
    util.check_path('mood_result/' + qqnumber)

    # Get the goal url, except the position argument.
    url_base = util.parse_moods_url(qqnumber)
    pos = 0
    key = True

    while key:
        print("	Dealing with position:	%d" % pos)
        url = url_base + "&pos=%d" % pos
        res = self.session.get(url, headers = self.headers)
        con = res.text
        with open('mood_result/' + qqnumber + '/' + str(pos), 'w') as f:
            f.write(con)

        if '''"msglist":null''' in con:
            key = False

        # Cannot access...
        if '''"msgnum":0''' in con:
            with open('crawler_log.log', 'a') as log_file:
                log_file.write("%s Cannot access..
" % qqnumber)
            key = False

        # Cookie expried
        if '''"subcode":-4001''' in con:
            with open('crawler_log.log', 'a') as log_file:
                log_file.write('Cookie Expried! Time is %s
' % time.ctime())
            sys.exit()

        pos += 20
        time.sleep(5)

程序运行的结果会保存在名为mood_result的文件夹中，其中包含以各好友QQ号码为名的文件夹，他们的说说信息文件都保存在对应的文件夹中。

－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－

其它说明

程序还有两个文件，util.py和main.py，后者是程序运行的入口，前者则包含了一些通用功能，例如获取cookie、生成发送HTTP请求时要用到的g_tk值、构造URL。此处讲一下g_tk值。

在前面第1步和第3步中，发送的HTTP请求的URL参数里面，都包含有g_tk值，这个值是通过cookie中的p_skey参数的值生成的。可以在登录QQ空间时通过F12查看JS文件，找到它的对应算法。它位于名为qzfl_v8_2.1.57的js文件中。由于该文件内容过大，近6千行，在firebug中直接看response还找不到，不过可以通过在response中搜索得到，或者将单独在浏览器中打开，就可以得到它的全部内容了。找到这个g_tk的计算方法：

不要被这里的hash误导，在python里面hash()是一个内置方法，但在JS中，在此处，它只是个变量名而已。在本爬虫程序里面是这样实现的：

def get_g_tk():
    ''' make g_tk value'''

    pskey_start = cookie.find('p_skey=')
    pskey_end = cookie.find(';', pskey_start)
    p_skey = cookie[pskey_start+7: pskey_end]

    h = 5381

    for s in p_skey:
        h += (h << 5) + ord(s)

    return h & 2147483647

主要是通过位移和并运算，得到一个唯一值。

最后

如第3步中贴出来的代码后面部分写的，如果好友的空间不对自己开放，那么是无法获取到他的说说的，发送请求后有返回，但主要内容是空的。

如果cookie过期了，程序会记录日志并自动退出。我的程序运行了15个小时，请求了494个好友的说说文件，发送1万1千多个请求（每个请求得到一个文件，我的结果文件夹中就有这么多个文件），cookie没有过期，也没有被空间反爬。哦，对了，为了防止反爬虫，本程序是使用每请求一个文件就暂停5秒的方式应对的。（所以才那么慢，也不敢上多线程）

最终获取到的所有好友的说说文件，还需要自己去提取所需要的信息。本程序只获取源数据，不处理数据。

Github代码链接：QQzone_crawler