这篇论文中提到的naive cube算法的实现,python写出来真的就和伪代码差不多=。=
输入大约长这样,依次是
index userid country state city topic category product sales
1 400141 3 78 3427 3 59 4967 4670.08
2 783984 1 34 9 1 5 982 5340.9
3 4945 1 47 1658 1 7 363 3065.37
4 468352 2 57 2410 2 37 3688 9561.13
5 553471 1 25 550 1 13 1476 3596.72
6 649149 1 9 234 1 12 1456 2126.29
...
输出的格式是这样,对于各个attr(用位置而不是名字表示)的各种value的搭配,输出对应group的measure的结果
<attr><attr><attr>...|<value><value>... <measure>
mapper:
#!/usr/bin/env python import sys from itertools import product def seq(start, end): return [range(start, i) for i in range(start, end + 2)] def read_input(file): for line in file: yield line.split() def main(): data = read_input(sys.stdin) C = [a + b for a, b in product(seq(2, 4), seq(5, 7))] for e in data: for R in C: k = [e[i] for i in R] print "%s|%s %s" % (' '.join([str(i) for i in R]), ' '.join(k), e[1]) if __name__ == "__main__": main()
reducer:
#!/usr/bin/env python from itertools import groupby from operator import itemgetter import sys def read_input(file): for line in file: yield line.rstrip().split(' ') def main(): data = read_input(sys.stdin) for key, group in groupby(data, itemgetter(0)): ids = set(uid for key, uid in group) print "%s %d" % (key, len(ids)) if __name__ == "__main__": main()
课程设计选python就可以玩各种缩短代码的奇技淫巧了好嗨森……