Deep Learning for Natural Language Processeing：vector space models

三种分类：

term–document

word–context

pair–pattern

semantics：the meaning of a word, a phrase, a sentence, or any text in human language, and the study of such meaning

特点：

从语料库中自动获取信息，节省工作量

衡量词、短语、文档相似性

The Term–Document Matrix：行向量为词terms,列向量为文档documents

bag:可包含重复元素的集合，表示为矩阵X in which each column x:j corresponds to a bag, each row xi: corresponds to a unique member, and an element xij is the frequency of the i-th member in the j-th bag

word–context

The distributional hypothesis in linguistics is that words that occur in similar contexts
tend to have similar meanings

pair–pattern

mason : stone
carpenter : wood

X cuts Y

“X works with Y

extended distributional hypothesis, that patterns that co-occur with similar pairs tend to have similar meanings、

latent relation hypothesis is that pairs of words that co-occur in similar patterns
tend to have similar semantic relations

attributional similarity： word–context sima(a, b) ∈R

relational similarity：pair–pattern simr(a : b, c : d) ∈R

A token is a single instance of a symbol, whereas a type is a general class of tokens

Statistical semantics hypothesis

If units of text have similar vectors in a text frequency matrix, then they tend to have similar meanings

Bag of words hypothesis

If documents and pseudodocuments (queries) have similar column vectors in a term–document matrix, then they tend to have similar meanings

Distributional hypothesis

If words have similar row vectors in a word–context matrix, then they tend to have similar meanings

Extended distributional hypothesis

If patterns have similar column vectors in a pair–pattern matrix, then they tend to express similar semantic relations

Latent relation hypothesis

If word pairs have similar row vectors in a pair–pattern matrix, then they tend to have similar semantic relations

Linguistic Processing for Vector Space Models

1.tokenize the raw text： decide what constitutes a term and how to extract terms from raw text

punctuation (e.g., don’t, Jane’s, and/or), hyphenation (e.g., state-of-the-art versus state of the art), and recognize multi-word terms (e.g., Barack Obama and ice hockey)

2.normalize the raw text: convert superficially different strings of characters to the same form

Case folding

3.annotate the raw text: mark identical strings of characters as being different

Mathematical Processing for Vector Space Models

1. generate a matrix of frequencies

First, scan sequentially through the corpus, recording events and their frequencies in a hash table, a database, or a search engine index. Second, use the resulting data structure to generate the frequency matrix, with a sparse matrix representation

2.adjust the weights of the elements in the matrix

tf-idf (term frequency × inverse document frequency) family of weighting functions

length normalization

Term weighting

Pointwise Mutual Information (PMI) problem： infrequent events

3.smooth the matrix to reduce the amount of random noise and to fill in some of the zero elements in a sparse matrix

Singular Value Decomposition (SVD)奇异值分解

latent meaning, noise reduction, high-order co-occurrence, and sparsity reduction

Optimizations and parallelization for similarity computing

sparse-matrix multiplication 相关性分解成三个部分，X的非零值，Y的非零值，X,Y中的非零值

分布式处理mapreduce hadoop

randomized algorithm: dimension reduction

machine learning