Lucene.net 系列五 search 上

在前面的系列我们一直在介绍有关索引建立的问题,现在是该利用这些索引来进行搜索的时候了,Lucene良好的架构使得我们只需要很少的几行代码就可以为我们的应用加上搜索的功能,首先让我们来认识一下搜索时最常用的几个类.

查询特定的某个概念

当我们搜索完成的时候会返回一个按Sorce排序的结果集Hits. 这里的Score就是接近度的意思,象Google那样每个页面都会有一个分值,搜索结果按分值排列. 如同你使用Google一样,你不可能查看所有的结果, 你可能只查看第一个结果所以Hits返回的不是所有的匹配文档本身, 而仅仅是实际文档的引用. 通过这个引用你可以获得实际的文档.原因很好理解, 如果直接返回匹配文档,数据量太大,而很多的结果你甚至不会去看, 想想你会去看Google 搜索结果10页以后的内容吗?

下面用一个例子来简要介绍一下Search

先建立索引

namespace dotLucene.inAction.BasicSearch

{

[TestFixture]

public class BaseIndexingTestCase

{

protected String[] keywords = {"1930110994", "1930110995"};

protected String[] unindexed = {"Java Development with Ant", "JUnit in Action"};

protected String[] unstored = {

"we have ant and junit",

"junit use a mock,ant is also",

};

protected String[] text1 = {

"ant junit",

"junit mock"

};

protected String[] text2 = {

"200206",

"200309"

};

protected String[] text3 = {

"/Computers/Ant", "/Computers/JUnit"

};

protected Directory dir;

[SetUp]

protected void Init()

{

string indexDir = "index";

dir = FSDirectory.GetDirectory(indexDir, true);

AddDocuments(dir);

}

protected void AddDocuments(Directory dir)

{

IndexWriter writer=new IndexWriter(dir, GetAnalyzer(), true);

for (int i = 0; i < keywords.Length; i++)

{

Document doc = new Document();

doc.Add(Field.Keyword("isbn", keywords[i]));

doc.Add(Field.UnIndexed("title", unindexed[i]));

doc.Add(Field.UnStored("contents", unstored[i]));

doc.Add(Field.Text("subject", text1[i]));

doc.Add(Field.Text("pubmonth", text2[i]));

doc.Add(Field.Text("category", text3[i]));

writer.AddDocument(doc);

}

writer.Optimize();

writer.Close();

}

protected virtual Analyzer GetAnalyzer()

{

PerFieldAnalyzerWrapper analyzer = new PerFieldAnalyzerWrapper(

new SimpleAnalyzer());

analyzer.AddAnalyzer("pubmonth", new WhitespaceAnalyzer());

analyzer.AddAnalyzer("category", new WhitespaceAnalyzer());

return analyzer;

}

这里用到了一些有关Analyzer的知识,将放在以后的系列中介绍.

查询特定的某个概念

然后利用利用TermQery来搜索一个Term(你可以把它理解为一个Word)

[Test]

public void Term()

{

IndexSearcher searcher = new IndexSearcher(directory);

Term t = new Term("subject", "ant");

Query query = new TermQuery(t);

Hits hits = searcher.Search(query);

Assert.AreEqual(1, hits.Length(), "JDwA");

t = new Term("subject", "junit");

hits = searcher.Search(new TermQuery(t));

Assert.AreEqual(2, hits.Length());

searcher.Close();

}

利用QueryParse简化查询语句

显然对于各种各样的查询(与或关系,等等各种复杂的查询,在下面将介绍),你不希望一一对应的为它们写出相应的XXXQuery. Lucene已经为你考虑到了这点, 通过使用QueryParse这个类, 你只需要写出我们常见的搜索语句, Lucene会在内部自动做一个转换.

这个过程有点类似于数据库搜索, 我们已经习惯于使用SQL查询语句,其实在数据库的内部是要做一个转换的, 因为数据库不认得SQL语句,它只认得查询语法树.

让我们来看一个例子.

[Test]

public void TestQueryParser()

{

IndexSearcher searcher = new IndexSearcher(directory);

Query query = QueryParser.Parse("+JUNIT +ANT -MOCK",

"contents",

new SimpleAnalyzer());

Hits hits = searcher.Search(query);

Assert.AreEqual(1, hits.Length());

Document d = hits.Doc(0);

Assert.AreEqual("Java Development with Ant", d.Get("title"));

query = QueryParser.Parse("mock OR junit",

"contents",

new SimpleAnalyzer());

hits = searcher.Search(query);

Assert.AreEqual(2, hits.Length(), "JDwA and JIA");

}

由以上的代码可以看出我们不需要为每种特定查询而去设定XXXQuery 通过QueryParse类的静态方法Parse就可以很方便的将可读性好的查询口语转换成Lucene内部所使用的各种复杂的查询语句. 有一点需要注意:在Parse方法中我们使用了SimpleAnalyzer, 这时候会将查询语句做一些变换,比如这里将JUNIT 等等大写字母变成了小写字母,所以才能搜索到(因为我们在建立索引的时候使用的是小写),如果你将StanderAnalyzer变成WhitespaceAnalyzer就会搜索不到.具体原理以后再说.

+A +B表示A和B要同时存在,-C表示C不存在,A OR B表示A或B二者有一个存在就可以..具体的查询规则如下:

其中title等等的field表示你在建立索引时所采用的属性名.