Predicting sentiment from product reviews情感预测

The goal of this first notebook is to explore logistic regression and feature engineering目标是探索逻辑回归和特征工程

In this notebook you will use product review data from Amazon.com to predict whether the sentiments about a product (from its reviews) are positive or negative通过亚马逊产品评论数据来预测产品是正类还是负类

首先我们来看一下数据大概是什么样的

下面我们来对评论数据进行处理~

　　先完成下面两件事：1）去掉标点符号

def remove_punctuation(text):
    import string
    return text.translate(None, string.punctuation)

　　2）统计词频

这一实现很常用了，所以一般的文本处理库都有相关实现，直接调用就好了，比如graphlab中的

graphlab.text_analytics.count_words(review_without_puctuation)

让我来看一下经过上两步处理后数据变成了什么样：

有没有很熟悉的感觉，我们之前在做文本聚类和检索任务时就执行过类似的操作，是的，我们新增了词频统计（word_count）这一列

然后我们再把视线移到评分上来，我们知道分类任务有一个必备条件就是样本标签，对于评论数据，我们将评分视为样本标签。我们现有数据是1-5的5类评分数据，为理解和学习起来简单方便，我们把大于3的视为正类，小于3的视为负类，把等于3的暂时抛弃，将问题转化成一个二分类问题来处理。

好的，我们再看一下经上步处理后的数据变成了什么样

OK，做完前期这些工作，我们需要将数据拆分为训练集和测试集两部分（这个也一样，很多包里都有实现）

train_data, test_data = products.random_split(.8, seed=1)

我这个还是简单的调用了一下graphlab中的函数，将样本分为80%的训练数据，20%的测试数据

接着便是用训练集进行模型的训练，说白了就是输入x到输出y映射函数的学习过程（找出y=f(x)这个表达式）

我还是先拿graphlab中的现成函数来完成，简单的函数调用

sentiment_model = graphlab.logistic_classifier.create(train_data,
                                                      target = 'sentiment',
                                                      features=['word_count'],
                                                      validation_set=None)

看看效果

让我们看一下正类的词有多少个，负类的词有多少个

weights = sentiment_model.coefficients

num_positive_weights = len(weights[weights['value'] >= 0])
num_negative_weights = len(weights[weights['value'] < 0])

print "Number of positive weights: %s " % num_positive_weights
print "Number of negative weights: %s " % num_negative_weights
print num_positive_weights + num_negative_weights

结果如下

好的，下面便可以开始我们的预测工作了

我们会得到一系列得分，范围为整个实数域

然后我们将大于0的视为正样本，小于0的视为负样本

相应的程序实现为：

hat_y = scores.apply(lambda score : +1 if score > 0 else -1)

上述过程属于硬分割，那我们能否根据概率来进行分类呢

我们知道sigmod函数就可以实现将整个实数域映射到[0,1]上，代码实现为：

def sigmoid(inX):
     return 1.0 / (1 + exp(-inX))

计算出概率后，我们对概率值进行排序，topk(https://turi.com/products/create/docs/generated/graphlab.SFrame.topk.html?highlight=topk#graphlab.SFrame.topk)

评估模型

当然，我们会有一个基准，

It is quite common to use the majority class classifier as the a baseline (or reference) model for comparison with your classifier model. The majority classifier model predicts the majority class for all data points. At the very least, you should healthily beat the majority class classifier, otherwise, the model is (usually) pointless.