哗啦啦python金融量化之路

金融量化的第一步:数据统计和分析。

我选择的教材是:利用python进行数据分析 O‘reilly出版


实用案例

1. 处理来自bit.ly的1.usa.gov数据。

  1) 数据: http://www.usa.gov/About/developer-resources/1usagov.shtml

    该数据为常见的json格式

  2)将json转换成字典

    注意事项:我是将该数据以TXT格式保存到本地进行处理的。需要去掉分隔符,同时因为内部有BOM字符,需要去除这些字符。再将这些字典读到列表中。

import os
import json,pickle

from collections import defaultdict
from collections import Counter

records = [] for line in open("haha6.txt", encoding = "utf8"): line = line.strip(" ") if line.startswith(u'ufeff'): line = line.encode('utf8')[3:].decode('utf8') #去掉Bom字符 line = json.loads(line, encoding = "utf-8") records.append(line)

print(records[0])

#output:
第一行数据如下:#{'u': 'http://today.lbl.gov/2016/06/24/saudi-minister-of-energy-visits-lab-on-june-20/#main',
#'_id': '27e6808c-3750-e5ac-002a-cfb577e72a48', 'r': 'direct', 'sl': '2963Ceb', 'h': '2963Ceb',
#'k': '', 'a': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML', 'c': 'FR',
#'hc': 1466804416, 'nk': 0, 'll': [48.8582, 2.3387], 'g': '2963Fqo',
#'t': 1467187377, 'hh': '1.usa.gov', 'l': 'anonymous', 'i': '', 'tz': 'Europe/Paris'}

  3) 查找所有的时区,并对其计数

time_zones = [rec["tz"] for rec in records if "tz" in rec]

##时区统计,列表里的字典元素的key的统计
#方法1
def get_counts(sequence):
    counts = {}
    for x in sequence:
        if x in counts:
            counts[x] += 1
        else:
            counts[x] = 1
    return counts
counts = get_counts(time_zones)
print(counts["America/New_York"])

#方法2
def get_counts1(sequence):
    counts = defaultdict(int)
    for x in sequence:
        counts[x] += 1
    return counts
counts = get_counts1(time_zones)
print(counts["America/New_York"])

#output: 353

  4) 取出前十的时区及其计数值

#方法1
def top_counts(count_dict, n = 10):
    value_key_pairs = [(count,tz) for tz, count in count_dict.items()]
    value_key_pairs.sort()
    return value_key_pairs[-n:]
print(top_counts(counts))
#方法2
counts = Counter(time_zones)
counts.most_common(10)
print(counts.most_common(10))

  5) 用pandas简化,对时区进行计数,并给出前十的柱状图

#用pandas对时区进行计数
from pandas import DataFrame
import pandas as pd
import numpy as np
frame = DataFrame(records)
#print(frame)
#tz_counts = frame["tz"].value_counts()
#print(tz_counts[:10])
clean_tz = frame["tz"].fillna("missing") #  缺失值处理
clean_tz[clean_tz == ""] = "unknown" # 空字符串处理
tz_counts = clean_tz.value_counts()
print(tz_counts[:10].plot(kind = "barh", rot=0))

#output是柱状图

2. 处理movielens的数据集

  1) 数据: http://www.grouplens.org/node/73

    数据分三个文件:

    - 用户文件,格式是: 1::F::1::10::48067

    - 评分文件,格式是:1::1193::5::978300760

    - 电影文件,格式是: 1::Toy Story (1995)::Animation|Children's|Comedy

  2) 统计成表格

   

import pandas as pd

usernames = ["user_id", "gender", "age", "occupation", "zip"]
users = pd.read_table("ml-1m/users.dat", sep = "::", header = None, names = usernames, engine = "python")

rnames = ["user_id", "movie_id", "rating", "timestamp"]
ratings = pd.read_table("ml-1m/ratings.dat", sep = "::", header = None, names = rnames, engine = "python")

movienames = ["movie_id", "title", "genres"]
movies = pd.read_table("ml-1m/movies.dat", sep = "::", header = None, names = movienames, engine = "python")

  3) 三张表格合并

 

data = pd.merge(pd.merge(ratings, users), movies)
#print(data)
print(data.ix[0])

  4) 按性别计算每部电影的平均分

mean_rating = data.pivot_table("rating", rows = "title", cols = "gender", aggfunc = "mean")
print(mean_rating[:5])
原文地址:https://www.cnblogs.com/hualala/p/5661251.html