Python学习笔记(2) Python提取《釜山行》人物关系

参考http://www.jianshu.com/p/3bd06f8816d7

 
项目原理:
  实验基于简单共现关系,编写 Python 代码从纯文本中提取出人物关系网络,并用Gephi 将生成的网络可视化。下面介绍共现网络的基本原理。(共现网络简单的英文介绍
 
共现网络的基本原理:
   实体间的共现是是一种基于统计信息的提取,关系密切的人物往往会在文中的多段连续出现,通过文中以出现的实体(人名),计算不同实体共同出现的比率和次数,设定一个阈值,大于该阈值认为实体间存在某种联系。
 
准备:
  1. 环境 windows Python3.6
  2. 模块jieba  https://github.com/fxsjy/jieba
  3. jephi软件

 人名字典  http://labfile.oss.aliyuncs.com/courses/677/dict.txt 

《釜山行》中文剧本  http://labfile.oss.aliyuncs.com/courses/677/busan.txt

  

代码:

# -*- coding: utf-8 -*-
import

os, sys
import jieba, codecs, math
import jieba.posseg as pseg


names = {} # 姓名字典
relationships = {} # 关系字典
lineNames = [] # 每段内人物关系

# count names
jieba.load_userdict("D:\ResearchContent\Exercise_Programm\PythonExercise\Python\dict.txt")

# 加载字典
with

codecs.open("D:\ResearchContent\Exercise_Programm\PythonExercise\Python\fushan.txt", "r", "utf8") as f

:
for

line in f.readlines()

:

poss = pseg.cut(line)     

# 分词并返回该词词性

lineNames.append([])      

# 为新读入的一段添加人物名称列表
for

w in poss

:
if

w.flag 

!= "nr" or len

(w.word) 

< 2:
continue
# 当分词长度小于2或该词词性不为nr时认为该词不为人名

lineNames[

-1

].append(w.word)      

# 为当前段的环境增加一个人物
if

names.get(w.word) 

is None:

names[w.word] = 

0

relationships[w.word] = {}
names[w.word]

+= 1

# 该人物出现次数加 1

# explore relationships
for

line in lineNames:             

# 对于每一段
for

name1 in line

:
for

name2 in line:          

# 每段中的任意两个人
if

name1 == name2:

continue
if

relationships[name1].get(name2) is None:       

# 若两人尚未同时出现则新建项

relationships[name1][name2]= 

1
else:

relationships[name1][name2] = relationships[name1][name2]

+ 1

# 两人共同出现次数加 1

# output
with

codecs.open("busan_node.txt", "w", "gbk") as f

:

f.write("Id Label Weight
")
for name, times in names.items()

:

f.write(name 

+ " " +

name 

+ " " + str

(times) 

+ "

")

with codecs.open("busan_edge.txt", "w", "gbk") as f

:

f.write("Source Target Weight
")
for name, edges in relationships.items()

:
for

v, w in edges.items()

:
if

w 

> 3:

f.write(name 

+ " " +

v 

+ " " + str

(w) 

+ "

")


 

参考:

共线网络简单英文介绍https://forec.github.io/2016/10/03/co-occurrence-structure-capture/

Python中文分词:结巴分词http://www.cnblogs.com/kaituorensheng/p/3595879.html

import as 解释:https://www.zhihu.com/question/20871904

修改2

原文地址:https://www.cnblogs.com/jiawang/p/6155186.html