8.最佳电影聚类分析

将使用电影简介作为原始数据，将总共 100 部流行电影进行聚类分析。IMDb 也称为互联网电影数据库（www.imdb.com），是一个在线的数据库，它提供有关电影、电子游戏和电视节目的大量详细信息。它聚集了电影和电视节目的评论以及简介，并有几个精选影片清单。原始数据地址 https://www.imdb.com/list/ls055592025/ 上获取名为 “100部最佳影片（中级清单），可在 https://github.com/brandomr/document_cluster 处进行下载。

解析并清洗了数据，并作为原始数据中缺少简介的几部电影添加了影片介绍。这些简介和电影描述来自维基百科。数据解析完成后，将它们存储在数据框中，并将其保存至 movie_data.csv 文件中：

$ wget https://mirror.shileizcc.com/wiki_Resources/python/text_analysis/movie_data.csv

将在聚类分析中加载并使用该文件中的数据，首先，需要加载并查看电影数据的内容，如下代码所示：

import pandas as pd
import numpy as np
 
movie_data = pd.read_csv('movie_data.csv')
 
print (movie_data.head())
 
 
movie_titles = movie_data['Title'].tolist()
movie_synopses = movie_data['Synopsis'].tolist()
 
print('Movie:', movie_titles[0])
print('Movie Synopsis:', movie_synopses[0][:1000])

结果：

                      Title                                           Synopsis
0             The Godfather  In late summer 1945, guests are gathered for t...
1  The Shawshank Redemption  In 1947, Andy Dufresne (Tim Robbins), a banker...
2          Schindler's List  The relocation of Polish Jews from surrounding...
3               Raging Bull  The film opens in 1964, where an older and fat...
4                Casablanca  In the early years of World War II, December 1...
Movie: The Godfather
Movie Synopsis: In late summer 1945, guests are gathered for the wedding reception of Don Vito Corleone's daughter Connie (Talia Shire) and Carlo Rizzi (Gianni Russo). Vito (Marlon Brando), the head of the Corleone Mafia family, is known to friends and associates as "Godfather." He and Tom Hagen (Robert Duvall), the Corleone family lawyer, are hearing requests for favors because, according to Italian tradition, "no Sicilian can refuse a request on his daughter's wedding day." One of the men who asks the Don for a favor is Amerigo Bonasera, a successful mortician and acquaintance of the Don, whose daughter was brutally beaten by two young men because she refused their advances; the men received minimal punishment. The Don is disappointed in Bonasera, who'd avoided most contact with the Don due to Corleone's nefarious business dealings. The Don's wife is godmother to Bonasera's shamed daughter, a relationship the Don uses to extract new loyalty from the undertaker. The Don agrees to have his men punish

可以看到以及有了电影标题和响应的内容简介，将其加载到数据框中，然后存储在变量了。前面的实处也给出了一个电影样本及其部分摘要。核心思路是使用这些电影简介作为原始输入来聚类电影并完成分组。将从这些简介中提取特征，并使用无监督的学习算法将它们进行聚类。电影标题则是用于表征数据，当想要可视化并展示聚类及其统计信息时，这些电影标题将会很有用。聚类算法的数据输入是从电影简介中提取的特征。在介绍每个聚类算法之前，将执行前面类似的模范化和特征提取过程：

from normalization import normalize_corpus
from utils import build_feature_matrix
 
# normalize corpus
norm_movie_synopses = normalize_corpus(movie_synopses,
                                       lemmatize=True,
                                       only_text_chars=True)
 
# extract tf-idf features
vectorizer, feature_matrix = build_feature_matrix(norm_movie_synopses,
                                                  feature_type='tfidf',
                                                  min_df=0.24, max_df=0.85,
                                                  ngram_range=(1, 2))

如出现如下错误：

...
ModuleNotFoundError: No module named 'HTMLParser'

请修改 normalization.py 文件内容为：

from html.parser import HTMLParser

# view number of features
print(feature_matrix.shape)
 
# get feature names
feature_names = vectorizer.get_feature_names()
 
# print sample features
print(feature_names[:20])

结果：

(100, 307)
['able', 'accept', 'across', 'act', 'agree', 'alive', 'allow', 'alone', 'along', 'already', 'although', 'always', 'another', 'anything', 'apartment', 'appear', 'approach', 'arm', 'army', 'around']

在规范化之后的文本中保留了文本表示，并提取了基于 TF-IDF 的一元分词和二元分词特征，以保证每个特征至少在 25% 的文档中出现，以及至多 85% 的文档使用词项 min_df 和 max_df。可以看出，对于 100 部电影，共有 100 行数据并且每部电影有 307 个特征，一些示例特征也显示在上述代码段中。接下来，以及准备好了特征和文档，将进行聚类分析。

k-means 聚类

k-means 聚类算法是一种基于质心的聚类模型，它尝试将数据聚类成等方差的组或聚类。该算法尝试将标准或度量——惯量（inertia）最小化，惯量也称为聚类内平方和。这种算法的一个主要缺点是，和其他所有基于质心的聚类模型一样，它需要事先指定聚类 k 的数量。该算法可能是目前最流行的聚类算法，由于其易用性及可扩展大量数据而被广泛使用。

现在，可以使用数学表达式正式的定义 k-means 聚类算法。假设有一个具有 N 个数据或样本的数据集 X，并且希望将它们分组为 K 个聚类，其中 K 是一个用户指定的参数。k-means 聚类算法会将 N 个数据点分离为 K 个不相交的分离聚类 C_k，并且每一个聚类均可以被样例聚类的平均值描述。这些平均值就是聚类的质心 µ_k，它们不必受质心必须是 X 的 N 个样本中的实际数据点这个条件的限制。该算法选择这些质心并以这样一种方式构建聚类——即惯量或聚类内平方和需要最小化，其数学表达式为：

聚类 C_i 和质心 µ₁ 中 iε{1,2,...,k}。劳埃德（Lloyd）算法是解决这个问题的一个方案，它是包含以下步骤的迭代过程：

通过从数据集 X 中选取 k 个随机样本，选择 k 个初始质心 µ_k。
通过将每个数据点或样本分配到离其最近的质心点来更新聚类。在数学上，可以将其表示为 C_k={x_n:||x_n-µ_k||<=all||x_n-µ_l||},其中 C_k 表示聚类。
根据步骤 2 获得的每个聚类的新聚类数据点重新计算并更新聚类，在数学上可以表示为：

其中 µ_k 表示质心。

以上迭代方式重复上述步骤，知道步骤 2 和步骤 3 得到的结果不再发生变化。使用这种方法需要注意一点，那就是即使优化过程中确保是收敛的，它仍然可能存在局部最小值，因此在实际使用中，该算法会在多个时期和不同迭代次数中运行多次，并且如果需要的话，可以从多个迭代结果中取平均值。收敛和局部最小的发生高度依赖于步骤 1 中最开始质心的初始化。一种方式是进行多次迭代，并进行多次随机初始化，然后取平均值。另一种方法是使用 scikit-learn 中的 kmeans + 方案，它在初始化质心时质心彼此原理，这被证明是十分有效的。

from sklearn.cluster import KMeans
 
def k_means(feature_matrix, num_clusters=5):
    km = KMeans(n_clusters=num_clusters,
                max_iter=10000)
    km.fit(feature_matrix)
    clusters = km.labels_
    return km, clusters
 
num_clusters = 5   
km_obj, clusters = k_means(feature_matrix=feature_matrix,
                           num_clusters=num_clusters)
movie_data['Cluster'] = clusters

g该代码段使用之前实现的 k-means 函数，根据电影简介中的 TF-IDF 特征对电影进行聚类，使用聚类分析的结果为每个电影分配聚类别标签，并将其存储在 'Cluster' 列的 movie_data 数据帧中。可以看到，在分时中国将 k 设置成了 5.以下代码段可以查看 5 个聚类的电影总数：

from collections import Counter
# get the total number of movies per cluster
c = Counter(clusters)
print (c.items())

结果：

dict_items([(2, 34), (1, 24), (0, 30), (3, 7), (4, 5)])

可以看到，正如前面所说的，共有 5 个聚类标签，从 0 到 4 ，并且对每个标签都有一些电影属于该聚类，该聚类中的电影数为上一列表中每个数组中的第二个元素。接下来，将会定义一些函数来提取详细的聚类分析情况，显示结果并可视化聚类。首先定义一个函数，从聚类分析中提取重要信息：

def get_cluster_data(clustering_obj, movie_data,
                     feature_names, num_clusters,
                     topn_features=10):
    cluster_details = {} 
    # get cluster centroids
    ordered_centroids = clustering_obj.cluster_centers_.argsort()[:, ::-1]
    # get key features for each cluster
    # get movies belonging to each cluster
    for cluster_num in range(num_clusters):
        cluster_details[cluster_num] = {}
        cluster_details[cluster_num]['cluster_num'] = cluster_num
        key_features = [feature_names[index]
                        for index
                        in ordered_centroids[cluster_num, :topn_features]]
        cluster_details[cluster_num]['key_features'] = key_features
         
        movies = movie_data[movie_data['Cluster'] == cluster_num]['Title'].values.tolist()
        cluster_details[cluster_num]['movies'] = movies
     
    return cluster_details

上述函数非常简单明了。所做的是提取每个聚类的关键特征，这些特征对于定义聚类很重要。它还能够检索每个聚类的电影标题，并将所有内容存储在字典中。

接下来，将定义一个使用此数据结构并能够清晰展示结构的函数：

def print_cluster_data(cluster_data):
    # print cluster details
    for cluster_num, cluster_details in cluster_data.items():
        print ('Cluster {} details:'.format(cluster_num))
        print ('-'*20)
        print ('Key features:', cluster_details['key_features'])
        print ('Movies in this cluster:')
        print (', '.join(cluster_details['movies']))
        print ('='*40)

在分析 k-means 聚类算法结构之前，还需要定义一个函数来实现聚类可视化。因为要处理的是多为特征空间和非结构化文本数据。如果直接可视化数字特征向量，那么这可能没什么意义。目前有一些技术，如主成分分析（PCA）或多为缩放（MDS）可以减少纬度，可以在二维或三维图中可视化这些聚类。再具体实现中，将使用 MDS 来可视化聚类。

MDS 是一种减少非线性纬度的方法，可以在降低纬度系统中更好地显现结果。它的核心思想史使用一个距离矩阵，以便获得各种数据点之间的距离。在这里，将使用余弦相似度。MDS 尝试使用向量空间中的高维特征构建数据的低维表示，这样使得在高维特征空间中使用余弦相似度获得的各种数据点的距离在使用较低纬表示时仍然大致相同。

MDS 的 scikit-learn 实现有两类算法：度量和非度量算法。在这里，将使用度量算法，因为将使用基于余弦相似度的距离度量来构建各种电影之间的输入相似度矩阵。在数学上，MDS 可以定义为：假设 S 是在特征矩阵上使用余弦相似度获得的各种数据点（电影）之间的相似度矩阵，X 是 n 个输入数据点（电影）的坐标，差异性（disparity），其通常是相似度值的一些最佳变换，甚至可能是原始相似值本身。MDS 的目标函数称为应力函数，其定义为 sum_i<jD_ij(X)-^d_ij(X)。使用以下函数实现基于 MDS 的聚类可视化：

import matplotlib.pyplot as plt
from sklearn.manifold import MDS
from sklearn.metrics.pairwise import cosine_similarity
import random
from matplotlib.font_manager import FontProperties
 
def plot_clusters(num_clusters, feature_matrix,
                  cluster_data, movie_data,
                  plot_size=(16,8)):
    # generate random color for clusters                 
    def generate_random_color():
        color = '#%06x' % random.randint(0, 0xFFFFFF)
        return color
    # define markers for clusters   
    markers = ['o', 'v', '^', '<', '>', '8', 's', 'p', '*', 'h', 'H', 'D', 'd']
    # build cosine distance matrix
    cosine_distance = 1 - cosine_similarity(feature_matrix)
    # dimensionality reduction using MDS
    mds = MDS(n_components=2, dissimilarity="precomputed",
              random_state=1)
    # get coordinates of clusters in new low-dimensional space
    plot_positions = mds.fit_transform(cosine_distance) 
    x_pos, y_pos = plot_positions[:, 0], plot_positions[:, 1]
    # build cluster plotting data
    cluster_color_map = {}
    cluster_name_map = {}
    for cluster_num, cluster_details in cluster_data.items():
        # assign cluster features to unique label
        cluster_color_map[cluster_num] = generate_random_color()
        cluster_name_map[cluster_num] = ', '.join(cluster_details['key_features'][:5]).strip()
    # map each unique cluster label with its coordinates and movies
    cluster_plot_frame = pd.DataFrame({'x': x_pos,
                                       'y': y_pos,
                                       'label': movie_data['Cluster'].values.tolist(),
                                       'title': movie_data['Title'].values.tolist()
                                        })
    grouped_plot_frame = cluster_plot_frame.groupby('label')
    # set plot figure size and axes
    fig, ax = plt.subplots(figsize=plot_size)
    ax.margins(0.05)
    # plot each cluster using co-ordinates and movie titles
    for cluster_num, cluster_frame in grouped_plot_frame:
         marker = markers[cluster_num] if cluster_num < len(markers) 
                  else np.random.choice(markers, size=1)[0]
         ax.plot(cluster_frame['x'], cluster_frame['y'],
                 marker=marker, linestyle='', ms=12,
                 label=cluster_name_map[cluster_num],
                 color=cluster_color_map[cluster_num], mec='none')
         ax.set_aspect('auto')
         ax.tick_params(axis= 'x', which='both', bottom='off', top='off',       
                        labelbottom='off')
         ax.tick_params(axis= 'y', which='both', left='off', top='off',        
                        labelleft='off')
    fontP = FontProperties()
    fontP.set_size('small')   
    ax.legend(loc='upper center', bbox_to_anchor=(0.5, -0.01), fancybox=True,
              shadow=True, ncol=5, numpoints=1, prop=fontP)
    #add labels as the film titles
    for index in range(len(cluster_plot_frame)):
        ax.text(cluster_plot_frame.ix[index]['x'],
                cluster_plot_frame.ix[index]['y'],
                cluster_plot_frame.ix[index]['title'], size=8) 
    # show the plot          
    plt.show()

函数的代码比较长，但是注释相当清楚的说明了函数的每一步。首先使用文档之间的余弦相似度建立相似度矩阵，获得余弦距离，然后使用 MDS 将高维特征空间转换为二维。然后，使用 matplotlib 会话聚类结果，并使用一些必要的格式来展现效果更好。该函数是一个通用函数，可以用于任何具有动态聚类数量的聚类算法。在显示结果中，每个聚类都会拥有自己的颜色、符号和标签，以区分图例框中的特征值。实际的额可视化结果将绘制每部电影机器具有自己颜色和符号的聚类标签。

现在，可以使用上述函数来分析 k-means 算法的聚类结果。以下代码段描述了 k-means 聚类的详细分析结果：

cluster_data =  get_cluster_data(clustering_obj=km_obj,
                                 movie_data=movie_data,
                                 feature_names=feature_names,
                                 num_clusters=num_clusters,
                                 topn_features=5)        
 
print_cluster_data(cluster_data)

结果：

Cluster 0 details:
--------------------
Key features: ['apartment', 'car', 'house', 'police', 'mother']
Movies in this cluster:
One Flew Over the Cuckoo's Nest, Citizen Kane, Psycho, Sunset Blvd., Vertigo, E.T. the Extra-Terrestrial, Singin' in the Rain, It's a Wonderful Life, Rocky, A Streetcar Named Desire, The Philadelphia Story, Ben-Hur, The Apartment, High Noon, Goodfellas, The French Connection, Annie Hall, Tootsie, Fargo, Close Encounters of the Third Kind, Nashville, The Graduate, The Maltese Falcon, A Clockwork Orange, Taxi Driver, Wuthering Heights, Double Indemnity, Rear Window, The Third Man, North by Northwest
========================================
Cluster 1 details:
--------------------
Key features: ['soldier', 'men', 'kill', 'army', 'officer']
Movies in this cluster:
Schindler's List, Casablanca, Gone with the Wind, Lawrence of Arabia, Star Wars, 2001: A Space Odyssey, The Bridge on the River Kwai, Some Like It Hot, Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb, Apocalypse Now, Gandhi, The Lord of the Rings: The Return of the King, From Here to Eternity, Saving Private Ryan, Unforgiven, Raiders of the Lost Ark, Patton, The Good, the Bad and the Ugly, Butch Cassidy and the Sundance Kid, Platoon, Dances with Wolves, The Deer Hunter, All Quiet on the Western Front, Shane
========================================
Cluster 2 details:
--------------------
Key features: ['father', 'family', 'brother', 'kill', 'son']
Movies in this cluster:
The Godfather, The Shawshank Redemption, Raging Bull, The Godfather: Part II, On the Waterfront, Forrest Gump, West Side Story, The Silence of the Lambs, 12 Angry Men, Amadeus, Gladiator, To Kill a Mockingbird, An American in Paris, The Best Years of Our Lives, My Fair Lady, Doctor Zhivago, Braveheart, The Treasure of the Sierra Madre, The Pianist, The Exorcist, City Lights, The King's Speech, It Happened One Night, A Place in the Sun, Mr. Smith Goes to Washington, Rain Man, Giant, The Grapes of Wrath, The Green Mile, American Graffiti, Pulp Fiction, Stagecoach, Rebel Without a Cause, Yankee Doodle Dandy
========================================
Cluster 3 details:
--------------------
Key features: ['water', 'attempt', 'show', 'board', 'decide']
Movies in this cluster:
The Wizard of Oz, Titanic, Chinatown, Jaws, Network, The African Queen, Mutiny on the Bounty
========================================
Cluster 4 details:
--------------------
Key features: ['child', 'become', 'life', 'love', 'eventually']
Movies in this cluster:
The Sound of Music, Midnight Cowboy, Out of Africa, Good Will Hunting, Terms of Endearment
========================================

制作图像：

plot_clusters(num_clusters=num_clusters,
              feature_matrix=feature_matrix,
              cluster_data=cluster_data,
              movie_data=movie_data,
              plot_size=(16,8))

上面的输出显示了每个聚类及其中电影的关键特征。每个聚类尤其主题描述，主题通过最主要的特征定义该聚类。可以看到流行的电影，如《教父》、《教父2》和《宾虚》等电影在同一类中，它们都涉及 “家庭” “爱” “战争” 等。诸如《星球大战》、《指环王》、《鹿猎人》、《角斗士》、《阿甘蒸煮》等电影与 “杀人” “士兵” “军队” “战争” 等主题的电影汇聚在一起。考虑到用于聚类的数据只是每部电影简介的几个段落，得出的结果绝对会非常有趣。

近邻传播聚类

k-means 算法虽然非常流行，但是有一个缺点，就是用户必须预先定义聚类数量。如果实际情况中有更多或更少的聚类呢？有一些观察聚类质量并计算最佳 k 值的方法。有兴趣的可以查看肘部法则（elbow method）和轮廓系数（silhouette coefficient）,它们是确定最佳 k 值的流行方法。在这里，将讨论另一种算法，它基于数据的固有属性来构建聚类，无需对聚类数量进行任何预先假设。近邻传播（Affinity Propagation，AP）算法基于特聚类个数据点中的 “消息传递” 的概念，并且不需要关于聚类数量的预先假设。

AP 算法通过在数据点之间传递消息直至达到收敛而创建聚类。整个数据集由少数作为样本代表的样本表示。这些样本类似于你从 k-means 或 k-medoids 算法中获得的质心。在数据点之间发送的小时表示一个数据点是否适合作为样本代表表示其他数据点。它在每次迭代中不断更新直到收敛，最终的样本是每个聚类的代表。请记住，这种方法的一个缺点是计算强度较大，因为消息在整个数据集中间的每一对数据点之间传递，在应用于大量数据时，可能需要相当穿的时间才能达到收敛。

现在，可以定义 AP 算法的步骤（由维基百科和 scikit-learn 提供）。假设有一个具有 n 个数据点的数据集 X，X={x₁,x₂,...,x_n}，假设 sim(x,y) 是衡量两点 x 和 y 之间相似度的相似度函数。在这个过程中，将再次使用余弦相似度。AP 算法通过执行两个消息传递步骤完成迭代过程，如下所示：