

Natural Language Inference: Using Attention



Fig. 1.  This section feeds pretrained GloVe to an architecture based on attention and MLPs for natural language inference.

1. The Model



Fig. 2.  Natural language inference using attention mechanisms.


from d2l import mxnet as d2l

import mxnet as mx

from mxnet import autograd, gluon, init, np, npx

from mxnet.gluon import nn


1.1. Attending



def mlp(num_hiddens, flatten):

    net = nn.Sequential()


    net.add(nn.Dense(num_hiddens, activation='relu', flatten=flatten))


    net.add(nn.Dense(num_hiddens, activation='relu', flatten=flatten))

return net


class Attend(nn.Block):

    def __init__(self, num_hiddens, **kwargs):

        super(Attend, self).__init__(**kwargs)

        self.f = mlp(num_hiddens=num_hiddens, flatten=False)

    def forward(self, A, B):

        # Shape of A/B: (batch_size, #words in sequence A/B, embed_size)

        # Shape of f_A/f_B: (batch_size, #words in sequence A/B, num_hiddens)

        f_A = self.f(A)

        f_B = self.f(B)

        # Shape of e: (batch_size, #words in sequence A, #words in sequence B)

        e = npx.batch_dot(f_A, f_B, transpose_b=True)

        # Shape of beta: (batch_size, #words in sequence A, embed_size), where

        # sequence B is softly aligned with each word (axis 1 of beta) in

        # sequence A

        beta = npx.batch_dot(npx.softmax(e), B)

        # Shape of alpha: (batch_size, #words in sequence B, embed_size),

        # where sequence A is softly aligned with each word (axis 1 of alpha)

        # in sequence B

        alpha = npx.batch_dot(npx.softmax(e.transpose(0, 2, 1)), A)

        return beta, alpha

1.2. Comparing


在比较步骤中,提供连接(运算符 [⋅,⋅])将一个序列中的单词和另一个序列中的对齐单词组合成一个函数g(多层感知器):

 class Compare(nn.Block):

    def __init__(self, num_hiddens, **kwargs):

        super(Compare, self).__init__(**kwargs)

        self.g = mlp(num_hiddens=num_hiddens, flatten=False)

    def forward(self, A, B, beta, alpha):

        V_A = self.g(np.concatenate([A, beta], axis=2))

        V_B = self.g(np.concatenate([B, alpha], axis=2))

        return V_A, V_B

1.3. Aggregating


class Aggregate(nn.Block):

    def __init__(self, num_hiddens, num_outputs, **kwargs):

        super(Aggregate, self).__init__(**kwargs)

        self.h = mlp(num_hiddens=num_hiddens, flatten=True)


    def forward(self, V_A, V_B):

        # Sum up both sets of comparison vectors

        V_A = V_A.sum(axis=1)

        V_B = V_B.sum(axis=1)

        # Feed the concatenation of both summarization results into an MLP

        Y_hat = self.h(np.concatenate([V_A, V_B], axis=1))

        return Y_hat

1.4. Putting All Things Together


class DecomposableAttention(nn.Block):

    def __init__(self, vocab, embed_size, num_hiddens, **kwargs):

        super(DecomposableAttention, self).__init__(**kwargs)

        self.embedding = nn.Embedding(len(vocab), embed_size)

        self.attend = Attend(num_hiddens)

        self.compare = Compare(num_hiddens)

        # There are 3 possible outputs: entailment, contradiction, and neutral

        self.aggregate = Aggregate(num_hiddens, 3)

    def forward(self, X):

        premises, hypotheses = X

        A = self.embedding(premises)

        B = self.embedding(hypotheses)

        beta, alpha = self.attend(A, B)

        V_A, V_B = self.compare(A, B, beta, alpha)

        Y_hat = self.aggregate(V_A, V_B)

        return Y_hat

2. Training and Evaluating the Model


2.1. Reading the dataset


batch_size, num_steps = 256, 50

train_iter, test_iter, vocab = d2l.load_data_snli(batch_size, num_steps)

read 549367 examples

read 9824 examples

2.2. Creating the Model


 embed_size, num_hiddens, ctx = 100, 200, d2l.try_all_gpus()

net = DecomposableAttention(vocab, embed_size, num_hiddens)

net.initialize(init.Xavier(), ctx=ctx)

glove_embedding = d2l.TokenEmbedding('glove.6b.100d')

embeds = glove_embedding[vocab.idx_to_token]


2.3. Training and Evaluating the Model



def split_batch_multi_inputs(X, y, ctx_list):

    """Split multi-input X and y into multiple devices specified by ctx"""

    X = list(zip(*[gluon.utils.split_and_load(

        feature, ctx_list, even_split=False) for feature in X]))

    return (X, gluon.utils.split_and_load(y, ctx_list, even_split=False))


lr, num_epochs = 0.001, 4

trainer = gluon.Trainer(net.collect_params(), 'adam', {'learning_rate': lr})

loss = gluon.loss.SoftmaxCrossEntropyLoss()

d2l.train_ch13(net, train_iter, test_iter, loss, trainer, num_epochs, ctx,


loss 0.511, train acc 0.798, test acc 0.823

9402.9 examples/sec on [gpu(0), gpu(1)]


2.4. Using the Model



def predict_snli(net, vocab, premise, hypothesis):

    premise = np.array(vocab[premise], ctx=d2l.try_gpu())

    hypothesis = np.array(vocab[hypothesis], ctx=d2l.try_gpu())

    label = np.argmax(net([premise.reshape((1, -1)),

                           hypothesis.reshape((1, -1))]), axis=1)

    return 'entailment' if label == 0 else 'contradiction' if label == 1

            else 'neutral'


predict_snli(net, vocab, ['he', 'is', 'good', '.'], ['he', 'is', 'bad', '.'])


3. Summary

  • The decomposable attention model consists of three steps for predicting the logical relationships between premises and hypotheses: attending, comparing, and aggregating.
  • With attention mechanisms, we can align words in one text sequence to every word in the other, and vice versa. Such alignment is soft using weighted average, where ideally large weights are associated with the words to be aligned.
  • The decomposition trick leads to a more desirable linear complexity than quadratic complexity when computing attention weights.
  • We can use pretrained word embedding as the input representation for downstream natural language processing task such as natural language inference.