倒排索引

倒排索引（inverted index）

常被成为反向索引、置入文档和反向档案，是一种索引方法，被用来存储在全文搜索下某个单词在一个文档

或者一组文档中的存储位置的映射。是文档检索系统中最常用的数据结构。

例如：

下面是要被索引的文本：

T0 = "it is what it is"

T1 = "what is it"

T2 = "it is a banana"

生成的倒排索引可以表示为下面所示：

"a" = {(2,2)}

"banana" = {(2,3)}

"is" = {(0,1),(0,4),(1,1),(2,1)}

"it" = {(0,0),(0,3),(1,2),(2,0)}

"what" = {(0,2),(1,0)}

我们可以得到这些完全反向索引，有（文档位置、查询单词所在文档中位置）组成的成对数据。

同样，文档位置、和查询单词所在文档中位置，都从零开始计算。

所以，"banana":{(2,3)}表示 banana在第三个文档中的第四个单词位置。

=====例子如下：

DATA:存储正向索引

word_index:存储倒排索引，每个空格分隔的单词作为key，

　　　　　　value是list结果，通过list.append方法，依次添加相应单词在文本文件中的位置()。

　　　　　　单词位置使用（行中index+所在行号）的形式表示。　　

#coding:utf-8
import sys

DATA = {}
word_index = {}# query->(line_no,word_index)

#using rever_index
#使用倒排结果
def check_index(sentense):
    query = sentense.split(' ')
    for v in query:
        if word_index.has_key(v)==True:
            #print word_index[v],"####",v
            for index_lineno in word_index[v]:  #['0.0','2,1','2,3']
                #print index_lineno
                print DATA[int(index_lineno.split('.')[1])]




if __name__ =="__main__":
    # 生成倒排
    line_num = 0
    for line in sys.stdin:
        line = line.strip(' 
')
        fields = line.split(' ')

        DATA[line_num] = line

        for i, val in enumerate(fields):
            if word_index.has_key(val) == False:
                word_index[val] = []
            word_index[val].append(".".join(
                [str(i), str(line_num)]))
        line_num += 1

    print word_index
    print DATA

    print "=====test query"
    queries = "it is example"
    print ("####input search sentense:%s",queries)
    print "####search result is :"
    check_index(queries)
    print "done=========="
    sys.exit(0)

=====

input.data 文本文件：

it is what it is
what is it
it is a banana
from your second example
When I run the algo using some sample
What am I doing wrong ?

======运行结果：

{'What': ['0.5'], 'doing': ['3.5'], 'is': ['1.0', '4.0', '1.1', '1.2'], 
'some': ['6.4'], 'it': ['0.0', '3.0', '2.1', '0.2'], 'sample': ['7.4'], 
'second': ['2.3'], 'your': ['1.3'], 'what': ['2.0', '0.1'], 'from': ['0.3'], 
'banana': ['3.2'], '?': ['5.5'], 'run': ['2.4'], 'I': ['1.4', '2.5'], 
'When': ['0.4'], 'wrong': ['4.5'], 'using': ['5.4'], 'a': ['2.2'], 
'am': ['1.5'], 'algo': ['4.4'], 'the': ['3.4'], 'example': ['3.3']}

{0: 'it is what it is', 1: 'what is it', 2: 'it is a banana', 
　　3: 'from your second example', 4: 'When I run the algo using some sample', 
　　5: 'What am I doing wrong ?'}
=====test query
('####input search sentense:%s', 'it is example')
####search result is :
it is what it is
it is what it is
what is it
it is a banana
it is what it is
it is what it is
what is it
it is a banana
from your second example
done==========