Identifying implicit relationships

Answering natural-language questions may often involve identifying hidden associations and implicit relationships.

The unifying theme for answering common-bond questions and missing-link questions is the need to identify concepts that are closely related to those given in the question.

In IBM Watson, we developed a recursive spreading-activation algorithm, which identifies related concepts based on a collection of heterogeneous underlying data resources. Watson’s spreading-activation process leverages both linked data extracted from a Web corpus, as well as lexical and syntactic resources derived from large text corpora to compute the degree of relatedness between concepts.

For common-bond questions, spreading activation is applied to each entity given in the question, and the most relevant and prominent concept related to all entities is selected as the answer.

For missing-link questions, the spreading-activation process is used to score the degree of relatedness between an identified missing link and a candidate answer.

Spreading activation for concept expansion:

Spreading activation refers to the idea that concepts in a semantic network may be activated through their connections with already active concepts based on a certain spreading strategy. This process allows us to identify concepts closely related to a given concept and to score the relatedness between two concepts.

Traditionally, concepts are represented in a semantic network where concept nodes are related to one another via certain types of relations, such as is-a and part-of.

However, rather than relying on manually created semantic networks to represent relatedness, Watson uses naturally occurring texts and measures concept relatedness on the basis of frequencies that concepts co-occur with one another under specific conditions in these texts.

Using n gram corpus

An efficient approach is to represent the corpus as a collection of n-grams that occurs greater than a minimum frequency, along with the frequency of each n-gram. Stemming and stop-word removal can further reduce the size of the n-gram corpus. To support spreading activation in Watson, we built a 5-gram corpus with frequency counts from Watson’s primary unstructured sources, which include Wikipedia and the Gigaword corpus.

Our n-gram-based spreading-activation implementation uses Lucene and includes application programming interfaces to support the retrieval of frequently collocated terms given a term and the computation of semantic similarity given two terms. For the former, given term t, the most frequent 5 grams that include t are retrieved from the corpus. All terms in the retrieved 5 grams are sorted by their total frequencies in those 5 grams, and the top f (indicated by the fan size) most frequent terms are returned. For the latter, the normalized Google** distance (NGD) semantic similarity metric was used to compute the semantic distance between two given terms based on the underlying n-gram corpus.

Using the PRISMATIC knowledge base

A resource frequently used in Watson is PRISMATIC, i.e., a knowledge base of extracted frames and slots based on syntactic and semantic relationships

One type of syntactic frame is the SVO (subject–verb–object) frame; for instance, the sentence “Ford pardoned Nixon in 1974” results in an SVO tuple, i.e., (Ford, pardon, Nixon). PRISMATIC provides quick access to statistics over these tuples. In particular, one can abstract out any element of a tuple and ask for the count of tuples matching that abstraction, e.g., the SVO query (Ford, ?v, ?o), where ?v and ?o are unbound variables, will provide a count of all SVO tuples for which Ford is the subject.

We use PRISMATIC similarly to the way we use n-grams, by estimating the degree of relatedness between two concepts with the frequency of how often they co-occur.

The core difference is that the n-gram component counts related words that appear lexically near each other, whereas the syntactic component counts related words that are syntactically connected.

Using Wikipedia links

We analyzed Wikipedia documents and the targets of links within each document and noted that the target document titles typically represent concepts closely related to the source document title.

Application to common-bond questions

Common-bond questions generally refer to questions that seek the hidden relationship among multiple entities.

Common-bond candidate generation

To maximize candidate recall, Watson’s common-bond candidate generator identifies concepts that are closely related to each entity individually and considers the union of all such concepts as possible candidates. We set the depth parameter d in our spreading-activation process to 1 and empirically determined the value of the fan size f to be 50, to balance the recall and the number of candidates produced.

The spreading-activation process is invoked on each entity given in the question. For most questions, the common bond can be found in lexical proximity to the given entities, we used only the n-gram corpus to identify the most frequently collocated terms with each given entity and proposed them as candidates.

Common-bond answer scorer

Each candidate proposed by the common-bond candidate generator is scored on the basis of its semantic relatedness to each given entity using our spreading-activation process. Specifically, we compute an NGD similarity score using our n-gram corpus.

The scores representing the candidate’s semantic relatedness to the given entities are multiplied together to represent the overall goodness of the candidate as a common-bond answer. This score is posted as a feature for the candidate answer to be considered in the final merging and ranking process

(How to score?)

Application to missing-link questions

For an entity to be a good missing link, there are two necessary conditions: It must be highly related to concepts in the question, and it must be ruled out as a possible correct answer. We applied an empirically selected threshold on the score produced by this model to select a small number of entities that are most highly associated with the question.

Candidate generation using missing links

In the second iteration, new search queries are produced by augmenting each existing query with a missing link.

Missing-link answer scorer

We believe that semantic-relatedness scorers through identified missing links are more likely to successfully capture the intended relationships expressed between the question and the correct answer.

The goal of the semantic-relatedness scorers is to determine, for each candidate answer, the degree of relatedness between the candidate answer and some concept in the question through an identified missing link.

Reference:

LUCENE: Apache Lucene. [Online]. Available: http://lucene.apache.org

normalized Google** distance (NGD) : R. Cilibrasi and P. Vitanyi, BAutomatic meaning discovery using Google,[ in In Kolmogorov Complexity and Applications, 2006. [Online]. Available: http://homepages.cwi.nl/~paulv/ papers/amdug.pdf

Candidate generation strategies:

J. Chu-Carroll, J. Fan, B. K. Boguraev, D. Carmel, D. Sheinwald, and C. Welty, BFinding needles in the haystack: Search and candidate generation,[ IBM J. Res. & Dev., vol. 56, no. 3/4, Paper 6, pp. 6:1–6:12, May/Jul. 2012.