[阅读笔记]仅用37行代码构造网站的全文检索

仅用37行代码构造网站的全文检索

英文标题:DotLucene: Full-Text Search for Your Intranet or Website using 37 Lines of Code

原文地址: http://www.codeproject.com/KB/aspnet/DotLuceneSearch.aspx

作者:Dan Letecky
线演示

示例下载(包含索引文件)

 

dotLucene是个不错的全文检索引擎.本文介绍用核心的37行代码构建网站的全文检索.





创建索引
:

 

IndexWriter writer =

   new IndexWriter(directory, new StandardAnalyzer(), true);

 

添加文档对象:

 

public void AddHtmlDocument(string path)

{

    Document doc = new Document();

 

    string rawText;

    using (StreamReader sr =

       new StreamReader(path, System.Text.Encoding.Default))

    {

        rawText = parseHtml(sr.ReadToEnd());

    }

   

    doc.Add(Field.UnStored("text", rawText));

    doc.Add(Field.Keyword("path", path));

    writer.AddDocument(doc);

}

 

优化并且保存索引:

 

writer.Optimize();

writer.Close();

 

打开索引查询:

 

IndexSearcher searcher = new IndexSearcher(directory);

 

开始检索:

 

Query query =

   QueryParser.Parse(q, "text", new StandardAnalyzer());

Hits hits = searcher.Search(query);

 

变量hits是一个文档结果对象集合类型,下面的代码遍历里面的结果并且将结果保存到DataTable.

 

DataTable dt = new DataTable();

dt.Columns.Add("path", typeof(string));

dt.Columns.Add("sample", typeof(string));

 

for (int i = 0; i < hits.Length(); i++)

{

    // get the document from index

    Document doc = hits.Doc(i);

 

    // get the document filename

    // we can't get the text from the index

    //because we didn't store it there

    DataRow row = dt.NewRow();

    row["path"] = doc.Get("path");

 

    dt.Rows.Add(row);

}

 

标识检索的关键字:

 

QueryHighlightExtractor highlighter =

  new QueryHighlightExtractor(query, new StandardAnalyzer(),

                         "<B>", "</B>");

 

在组织结果的过程中,可以通过如下代码只取出和结果相关的部分:

 

for (int i = 0; i < hits.Length(); i++)

{

    // ...

    string plainText;

    using (StreamReader sr =

      new StreamReader(doc.Get("filename"),

                  System.Text.Encoding.Default))

    {

        plainText = parseHtml(sr.ReadToEnd());

    }

    row["sample"] =

       highlighter.GetBestFragments(plainText, 80, 2, "...");

    // ...

}

 

 


 

相关资源:

DotLucene下载地址

DotLucene在线演示

DotLucene演示说明

DotLucene文档

原文地址:https://www.cnblogs.com/aspnetx/p/1013163.html