Lucene2.9.2 + 盘古分词2.3.1(一) 入门: 建立简单索引,搜索(原创)

有图有真相

QQ截图20110826113830

ps:上图可以看到中文分词成功,搜索也命中了;

说明:如果想好好学Lucene建议看Lucene in action 2nd version,另外2.9.2中对以前很多方法已经废弃,旧代码就别看了;

下面是代码:

建立索引
  1. public static void IndexFile(this IndexWriter writer, IO.FileInfo file)
  2. {
  3.     var watch = new Stopwatch();
  4.     var startTime = DateTime.Now;
  5.     watch.Start();
  6.     Console.WriteLine("Indexing  {0}", file.Name);
  7.     writer.AddDocument(file.GetDocument());
  8.     watch.Stop();
  9.     var timeSpan = DateTime.Now - startTime;
  10.     Console.WriteLine("Indexing Completed! Cost time {0}[{1}]", timeSpan.ToString("c"), watch.ElapsedMilliseconds);
  11.  
  12.   }
  13.  
  14. public static Document GetDocument(this IO.FileInfo file)
  15. {
  16.     var doc = new Document();
  17.     doc.Add(new Field("contents", new IO.StreamReader(file.FullName)));
  18.     doc.Add(new Field("filename", file.Name,
  19.     Field.Store.YES, Field.Index.ANALYZED));
  20.     doc.Add(new Field("fullpath", file.FullName,
  21.     Field.Store.YES, Field.Index.NOT_ANALYZED));
  22.     return doc;
  23. }

Output

Indexing Scott.txt
Indexing Completed! Cost time 00:00:02.4231386[2423]
Indexing 黄金瞳.txt
Indexing Completed! Cost time 00:00:00.0860049[85]
There are 2 doc Indexed!
Index Exit!

代码解释:

第14行 GetDocument 建立相应的doc,doc是Lucene核心对象之一,下面是它的定义:

The Document class represents a collection of fields. Think of it as a virtual document—
a chunk of data, such as a web page, an email message, or a text file—that you
want to make retrievable at a later time. Fields of a document represent the document
or metadata associated with that document. The original source (such as a database
record, a Microsoft Word document, a chapter from a book, and so on) of
document data is irrelevant to Lucene. It’s the text that you extract from such binary
documents, and add as a Field instance, that Lucene processes. The metadata (such
as author, title, subject and date modified) is indexed and stored separately as fields
of a document.

不关心的同学可以将它理解为数据库里表的一条记录,最后查询出结果后得到的也是doc对象,也就是一条记录;

那么建立索引就是添加很多记录到lucene里;

第19行 第一个参数就不解释了,第二个参数NOT_ANALYZED并不是就搜不到这个字段而是作为整个字段搜索,不分词而已;

搜索
  1. public ActionResult Index(string keyWord)
  2.         {
  3.             var originalKeyWords = keyWord;
  4.             ViewBag.TotalResult = 0;
  5.             ViewBag.Results = new List<KeyValuePair<string, string>>();
  6.             if (string.IsNullOrEmpty(keyWord))
  7.             { ViewBag.Message = "Welcome Today!"; return View("Index"); }
  8.  
  9.             var q = keyWord;
  10.  
  11.             var search = new IndexSearcher(_indexDir, true);
  12.            // q = GetKeyWordsSplitBySpace(q, new PanGuTokenizer());
  13.  
  14.             var queryParser =  new QueryParser(Lucene.Net.Util.Version.LUCENE_29, "contents", new PanGuAnalyzer(false));
  15.             var query = queryParser.Parse(q);
  16.             var hits = search.Search(query, 100); //search.Search(bq, 100);
  17.  
  18.             var recCount = hits.totalHits;
  19.             ViewBag.TotalResult = recCount;
  20.             
  21.             //show explain
  22.             for (int d = 0; d < search.MaxDoc(); d++)
  23.             {
  24.                 ViewBag.Explain += search.Explain(query, d).ToHtml();
  25.  
  26.                 var termReader=search.GetIndexReader().Terms();
  27.                 ViewBag.Explain += "<ul >";
  28.                 do
  29.                 {
  30.                     if(termReader.Term()!=null)
  31.                     ViewBag.Explain += string.Format("<li>{0}</li>", termReader.Term().Text());
  32.                 } while (termReader.Next());
  33.                 ViewBag.Explain += "</ul>";
  34.             }
  35.  
  36.             foreach (var hit in hits.scoreDocs)
  37.             {
  38.                 try
  39.                 {
  40.                     var doc = search.Doc(hit.doc);
  41.                     var fileName = doc.Get("filename");
  42.                     // fileName = highlighter.GetBestFragment(originalKeyWords, fileName);
  43.                     //var contents = GetBestFragment(originalKeyWords, new StreamReader(doc.Get("fullpath"), Encoding.GetEncoding("gb2312")));
  44.                     (ViewBag.Results as List<KeyValuePair<string, string>>)
  45.                         .Add(new KeyValuePair<string, string>(fileName, string.Empty));
  46.                 }
  47.                 catch (Exception exc)
  48.                 {
  49.                     Response.Write(exc.Message);
  50.                     throw;
  51.                 }
  52.  
  53.             }
  54.  
  55.             search.Close();
  56.  
  57.             ViewBag.Message = string.Format("????{0}", keyWord);
  58.             return View("Index");
  59.         }

后续文章会继续贴这些代码,并带上注释,在外面写距离有点远,也累。

原文地址:https://www.cnblogs.com/jinzhao/p/2154229.html