[译] 第二十天：Stanford CoreNLP

前言

今天的30天挑战，我决定学习用斯坦福CoreNLP Java API执行情感分析。几天前，我写了怎样用TextBlob API用Python进行情感分析。我开发了个程序对给定的一些关键字在tweets给出情感分析，现在来看看这个程序看它怎样分析的。

程序

今天的demo放在OpenShift上 http://sentiments-t20.rhcloud.com/. 有两个功能。

首先，如果你给出一些twitter搜索条目，它会显示至少20条最新对应的tweets情感观点。你需要勾上如图所示的激活框，支持的观点显示绿色，反对的显示红色。
第二个功能是如图所示的对一些文字做情感分析。

Stanford CoreNLP是什么？

Stanford CoreNLP是一个Java自然语言处理库，它集成了所有NLP工具，包括POS(part-of-speech) tagger, NER(the named entity recognizer), parser工具, 指代消解(coreference resolution)系统，情感分析工具和提供对英文分析的模型文件。

前提准备

Java基础知识，安装最新的Java Development Kit(JDK), 可以安装OpenJDK 7或者Oracle JDK 7, OpenShift支持OpenJDK 6 和7.
从官网下载Stanford CoreNLP包。
在OpenShift上注册。OpenShift完全免费，红帽给每个用户免费提供了3个Gears来运行程序。目前，这个资源分配合计有每人1.5GB内存，3GB磁盘空间。
在本机安装rhc 客户端工具，rhc是ruby gem包，所以你需要安装1.8.7或以上版本的ruby。安装rhc，输入 sudo gem install rhc. 如果已经安装了，确保是最新的，要更新rhc,输入sudo gem update rhc. 想了解rhc command-line 工具，更多帮助参考 https://www.openshift.com/developers/rhc-client-tools-install.
用rhc setup 命令安装OpenShift. 执行命令可以帮你创建空间，上传ssh 密钥到OpenShift服务器。

Github仓库

今天的demo放在github: day20-stanford-sentiment-analysis-demo.

快速启动和运行SentimentsApp

先来创建demo程序，命名sentimentsapp.

$ rhc create-app sentimentsapp jbosseap --from-code=https://github.com/shekhargulati/day20-stanford-sentiment-analysis-demo.git

如果你有普通gears权限，可以用以下命令。

$ rhc create-app sentimentsapp jbosseap -g medium --from-code=https://github.com/shekhargulati/day20-stanford-sentiment-analysis-demo.git

这会创建一个叫gear的程序容器，安装所需的SELinux策略和cgroup配置，OpenShift也会为你安装一个私有git仓库，克隆到本地，然后它会把DNS传播到网络。可访问 http://sentimentsapp-{domain-name}.rhcloud.com/ 查看程序。替换你自己唯一的OpenShift域名(有时也叫命名空间)。

这个程序也需要对应到twitter程序的4个环境变量，到 https://dev.twitter.com/apps/new 去新建twitter程序，创建如下4个环境变量。

$ rhc env set TWITTER_OAUTH_ACCESS_TOKEN=<please enter value> -a sentimentsapp

 

$ rhc env set TWITTER_OAUTH_ACCESS_TOKEN_SECRET=<please enter value> -a sentimentsapp

 

$rhc env set TWITTER_OAUTH_CONSUMER_KEY=<please enter value> -a sentimentsapp

 

$rhc env set TWITTER_OAUTH_CONSUMER_SECRET=<please enter value> -a sentimentsapp

View Code

现在重启程序，确保服务器可访问环境变量。

$ rhc restart-app --app sentimentsapp

后台

从pom.xml里给stanford-corenlp和twitter4j添加maven依赖开始，请用3.3.0版本的stanford-corenlp,因为添加的是这个版本的情感分析API.

<dependency>

    <groupId>edu.stanford.nlp</groupId>

    <artifactId>stanford-corenlp</artifactId>

    <version>3.3.0</version>

</dependency>

 

<dependency>

    <groupId>org.twitter4j</groupId>

    <artifactId>twitter4j-core</artifactId>

    <version>[3.0,)</version>

</dependency>

View Code

Twitter4j依赖在twitter搜索时需要。

更新pom.xml里几个属性把maven项目更新到Java 7.

<maven.compiler.source>1.7</maven.compiler.source>

<maven.compiler.target>1.7</maven.compiler.target>

View Code

现在更新Maven项目, 右击>Maven>Update Project.

激活CDI

我们会用CDI注入依赖，CDI或者Context和依赖注入是Java EE 6的特性，可以在Java EE 6项目里激活依赖注入。CDI为Java EE定义了安全类型的依赖注入机制。几乎所有的POJO可以作为CDI bean注入。

在src/main/webapp/WEB-INF文件夹下新建beans.xml文件，用以下内容更新beans.xml.

<beans xmlns="http://java.sun.com/xml/ns/javaee" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

    xsi:schemaLocation="http://java.sun.com/xml/ns/javaee http://java.sun.com/xml/ns/javaee/beans_1_0.xsd">

</beans>

View Code

用关键字搜索twitter

现在新建类TwitterSearch, 用Twitter4J支持关键字搜索，这个API需要twitter程序配置参数，我们用环境变量而不是hard coding来获取值。

import java.util.Collections;

import java.util.List;

 

import twitter4j.Query;

import twitter4j.QueryResult;

import twitter4j.Status;

import twitter4j.Twitter;

import twitter4j.TwitterException;

import twitter4j.TwitterFactory;

import twitter4j.conf.ConfigurationBuilder;

 

public class TwitterSearch {

 

    public List<Status> search(String keyword) {

        ConfigurationBuilder cb = new ConfigurationBuilder();

        cb.setDebugEnabled(true).setOAuthConsumerKey(System.getenv("TWITTER_OAUTH_CONSUMER_KEY"))

                .setOAuthConsumerSecret(System.getenv("TWITTER_OAUTH_CONSUMER_SECRET"))

                .setOAuthAccessToken(System.getenv("TWITTER_OAUTH_ACCESS_TOKEN"))

                .setOAuthAccessTokenSecret(System.getenv("TWITTER_OAUTH_ACCESS_TOKEN_SECRET"));

        TwitterFactory tf = new TwitterFactory(cb.build());

        Twitter twitter = tf.getInstance();

        Query query = new Query(keyword + " -filter:retweets -filter:links -filter:replies -filter:images");

        query.setCount(20);

        query.setLocale("en");

        query.setLang("en");;

        try {

            QueryResult queryResult = twitter.search(query);

            return queryResult.getTweets();

        } catch (TwitterException e) {

            // ignore

            e.printStackTrace();

        }

        return Collections.emptyList();

    }

}

View Code

以上代码，我们过滤了搜索结果，确保没有retweet, 带链接的tweet, 或者带图片的tweet返回，因为我们要确保得到的tweet要有文字。

情感分析器

接下来我们创建一个类SentimentAnalyzer,用于对单个tweet执行情感分析。

public class SentimentAnalyzer {

 

    public TweetWithSentiment findSentiment(String line) {

 

        Properties props = new Properties();

        props.setProperty("annotators", "tokenize, ssplit, parse, sentiment");

        StanfordCoreNLP pipeline = new StanfordCoreNLP(props);

        int mainSentiment = 0;

        if (line != null && line.length() > 0) {

            int longest = 0;

            Annotation annotation = pipeline.process(line);

            for (CoreMap sentence : annotation.get(CoreAnnotations.SentencesAnnotation.class)) {

                Tree tree = sentence.get(SentimentCoreAnnotations.AnnotatedTree.class);

                int sentiment = RNNCoreAnnotations.getPredictedClass(tree);

                String partText = sentence.toString();

                if (partText.length() > longest) {

                    mainSentiment = sentiment;

                    longest = partText.length();

                }

            }

        }

        if (mainSentiment == 2 || mainSentiment > 4 || mainSentiment < 0) {

            return null;

        }

        TweetWithSentiment tweetWithSentiment = new TweetWithSentiment(line, toCss(mainSentiment));

        return tweetWithSentiment;

 

    }

}

View Code

我们复制了englishPCFG.ser.gz 和sentiment.ser.gz 模型到src/main/resources/edu/stanford/nlp/models/lexparser 和src/main/resources/edu/stanford/nlp/models/sentiment 文件夹。

创建SentimentsResource

最后，创建JAX-RS资源类。

public class SentimentsResource {

 

    @Inject

    private SentimentAnalyzer sentimentAnalyzer;

 

    @Inject

    private TwitterSearch twitterSearch;

 

    @GET

    @Produces(value = MediaType.APPLICATION_JSON)

    public List<Result> sentiments(@QueryParam("searchKeywords") String searchKeywords) {

        List<Result> results = new ArrayList<>();

        if (searchKeywords == null || searchKeywords.length() == 0) {

            return results;

        }

 

        Set<String> keywords = new HashSet<>();

        for (String keyword : searchKeywords.split(",")) {

            keywords.add(keyword.trim().toLowerCase());

        }

        if (keywords.size() > 3) {

            keywords = new HashSet<>(new ArrayList<>(keywords).subList(0, 3));

        }

        for (String keyword : keywords) {

            List<Status> statuses = twitterSearch.search(keyword);

            System.out.println("Found statuses ... " + statuses.size());

            List<TweetWithSentiment> sentiments = new ArrayList<>();

            for (Status status : statuses) {

                TweetWithSentiment tweetWithSentiment = sentimentAnalyzer.findSentiment(status.getText());

                if (tweetWithSentiment != null) {

                    sentiments.add(tweetWithSentiment);

                }

            }

 

            Result result = new Result(keyword, sentiments);

            results.add(result);

        }

        return results;

    }

}

View Code

以上代码：

先检查searchekeywords是否非NULL或非空，然后分割到数组中，只考虑3个搜索条目。
然后对每个搜索条目找到tweet再做情感分析。
最后呈现结果给用户。

这就是今天的内容，继续给反馈吧。

原文：https://www.openshift.com/blogs/day-20-stanford-corenlp-performing-sentiment-analysis-of-twitter-using-java