猫眼电影加密数字破解（爬取评分票房票价）

title: 猫眼电影加密数字破解（爬取评分票房票价）
toc: true
date: 2018-07-01 22:05:27
categories:

methods

tags:

爬虫
Python

背景

在爬取猫眼电影相关数据时发现爬取下来的评分、票房、票价不是具体的数字而是一串类似于uf5fb的码，需要解密。

而这些密码是每次访问时随机生成的，和0-9的映射关系也是随机的。

解密办法

下载动态字体文件，解析映射关系。

解密思路

首先找到动态字体文件的地址（head标签内的style标签内）：

<style>
    @font-face {
      font-family: stonefont;
      src: url('//vfile.meituan.net/colorstone/e954129d5204b4e8c783c95f7da4c2733168.eot');
      src: url('//vfile.meituan.net/colorstone/e954129d5204b4e8c783c95f7da4c2733168.eot?#iefix') format('embedded-opentype'),
           url('//vfile.meituan.net/colorstone/8f497cdb4e39d1f3dcbafa28a486aea42076.woff') format('woff');
    }

    .stonefont {
      font-family: stonefont;
    }
  </style>

其中的.woff文件是我们需要的。

爬取代码如下（利用scrapy）：

#下载字体文件
font_url = sel.xpath('/html/head/style/text()').extract()[0]
font_url = 'http:'+font_url[font_url.rfind('url')+5:font_url.find('woff')+4]
print(font_url)
woff_path = 'tmp.woff'
f = urllib.request.urlopen(font_url)
data = f.read()
with open(woff_path, "wb") as code:
    code.write(data)

利用TTFont将woff文件转换为xml文件：

font1 = TTFont('tmp.woff')
font1.saveXML('tmp.xml')

查看xml文件会发现一个映射关系：

<GlyphOrder>
    <!-- The 'id' attribute is only for humans; it is ignored when parsed. -->
    <GlyphID id="0" name="glyph00000"/>
    <GlyphID id="1" name="x"/>
    <GlyphID id="2" name="uniF753"/>
    <GlyphID id="3" name="uniEA72"/>
    <GlyphID id="4" name="uniEE4E"/>
    <GlyphID id="5" name="uniECE6"/>
    <GlyphID id="6" name="uniE140"/>
    <GlyphID id="7" name="uniF4B0"/>
    <GlyphID id="8" name="uniE1B7"/>
    <GlyphID id="9" name="uniF245"/>
    <GlyphID id="10" name="uniE488"/>
    <GlyphID id="11" name="uniE6DA"/>
</GlyphOrder>

但是使用这个映射关系解码发现解密出来的数字不对，因此GlyphOrder并不是我们需要的映射关系。

xml文件往下翻，发现了字体数据：

<TTGlyph name="uniF245" xMin="0" yMin="0" xMax="508" yMax="716">
  <contour>
    <pt x="323" y="0" on="1"/>
    <pt x="323" y="171" on="1"/>
    <pt x="13" y="171" on="1"/>
    <pt x="13" y="252" on="1"/>
    <pt x="339" y="716" on="1"/>
    <pt x="411" y="716" on="1"/>
    <pt x="411" y="252" on="1"/>
    <pt x="508" y="252" on="1"/>
    <pt x="508" y="171" on="1"/>
    <pt x="411" y="171" on="1"/>
    <pt x="411" y="0" on="1"/>
  </contour>
  <contour>
    <pt x="323" y="252" on="1"/>
    <pt x="323" y="575" on="1"/>
    <pt x="99" y="252" on="1"/>
  </contour>
  <instructions/>
</TTGlyph>

看到这里突然想到，无论unicode码怎么变，数字渲染出来的样子是不会变的，因此可以从字体数据入手：

0-9每一个数字都有对应的一个TTGlyph数据，首先对一个已知映射关系的字体文件进行分析，获取0-9的字体数据，然后对于每次下载的动态字体文件，将其字体信息与0-9的字体数据进行对比就可以知道其映射关系了。

首先需要一份已知映射关系的xml文件作为映射关系对比文件，将其命名为data.xml，然后使用百度字体编辑器分析其对应的woff获取其映射关系(由于我的data.xml对应的woff文件删掉了，因此这里截图的是一个随机的woff文件对应的映射关系，可能与后边的代码内的映射关系不同，特此说明)：

创建data.xml对应的映射关系的字典：

data_dict = {"uniE184":"4","uniE80B":"3","uniF22E":"8","uniE14C":"0",
		"uniF5FB":"6","uniEE59":"5","uniEBD3":"1","uniED85":"7","uniECB8":"2","uniE96A":"9"}

要对比字体数据就要对xml文件进行分析，因此创建相关xml分析函数：

获取某节点指定属性的值：

def getValue(node, attribute):
	return node.attributes[attribute].value

字体数据的标签为TTGlyph，创建获取一个xml文件中所有的文字信息节点的函数：

def getTTGlyphList(xml_path):
	dataXmlfilepath = os.path.abspath(xml_path)
	dataDomObj = xmldom.parse(dataXmlfilepath)
	dataElementObj = dataDomObj.documentElement
	dataTTGlyphList = dataElementObj.getElementsByTagName('TTGlyph')
	return dataTTGlyphList

判断两个TTGlyph节点数据是否相同的函数：

def isEqual(ttglyph_a, ttglyph_b):
	a_pt_list = ttglyph_a.getElementsByTagName('pt')
	b_pt_list = ttglyph_b.getElementsByTagName('pt')
	a_len = len(a_pt_list)
	b_len = len(b_pt_list)
	if a_len != b_len:
		return False
	for i in range(a_len):
		if getValue(a_pt_list[i], 'x') != getValue(b_pt_list[i], 'x')  or getValue(a_pt_list[i], 'y') != getValue(b_pt_list[i], 'y') or getValue(a_pt_list[i], 'on') != getValue(b_pt_list[i], 'on'):
			return False
	return True

===============================================

相关函数建好后可以继续分析：

由于每次的unicode码是随机生成的，因此还需要知道新的0-9对应的unicode码是多少，为了方便直接使用函数获取了上边提到过的映射关系不对的GlyphOrder，是一个key为unicode，value为数字的字典：

decode_dict = dict(enumerate(font1.getGlyphOrder()[2:]))
decode_dict = dict(zip(decode_dict.values(),decode_dict.keys()))

获取已知映射关系的data.xml的字体数据节点和新的动态字体文件的数据节点：

dataTTGlyphList = getTTGlyphList("data.xml")
tmpTTGlyphList = getTTGlyphList("tmp.xml")

利用字体数据更新映射字典：

decode_dict = refresh(decode_dict,tmpTTGlyphList,dataTTGlyphList)

更新函数的具体实现如下：

def refresh(dict, ttGlyphList_a, ttGlyphList_data):
	data_dict = {"uniE184":"4","uniE80B":"3","uniF22E":"8","uniE14C":"0",
		"uniF5FB":"6","uniEE59":"5","uniEBD3":"1","uniED85":"7","uniECB8":"2","uniE96A":"9"}
	data_keys = data_dict.keys()
	for ttglyph_data in ttGlyphList_data:
		if 	getValue(ttglyph_data,'name') in data_keys:
			for ttglyph_a in ttGlyphList_a:
				if isEqual(ttglyph_a, ttglyph_data):
					dict[getValue(ttglyph_a,'name')] = data_dict[getValue(ttglyph_data,'name')]
					break
	return dict

考虑到小数的情况，加入小数点映射：

decode_dict['.'] = '.'

实现解码函数（输入映射字典和一个需要解密的数值，输出解密后的结果如15.6）：

def decode(decode_dict, code):
	_lst_uincode = []
	for item in code.__repr__().split("\u"):
		_lst_uincode.append("uni" + item[:4].upper())
		if item[4:]:
			_lst_uincode.append(item[4:])
	_lst_uincode = _lst_uincode[1:-1]
	result = "".join([str(decode_dict[i]) for i in _lst_uincode])
	return result

==================================================

具体代码链接