lucene demo引出的思考

org.apache.lucene.demo.IndexFiles 类中，使用递归的方式去索引文件。在构造了一个IndexWriter索引器之后，就可以向索引器中添加Doucument了，执行真正地建立索引的过程。遍历每个目录，因为每个目录中可能还存在目录，进行深度遍历，采用递归技术找到处于叶节点处的文件(普通的具有扩展名的文件，比如my.txt文件)，然后调用如下代码中：

[java] view plain copy

static void indexDocs(IndexWriter writer, File file)
throws IOException {
// file可以读取
if (file.canRead()) {
if (file.isDirectory()) { // 如果file是一个目录(该目录下面可能有文件、目录文件、空文件三种情况)
String[] files = file.list(); // 获取file目录下的所有文件(包括目录文件)File对象，放到数组files里
// 如果files!=null
if (files != null) {
for (int i = 0; i < files.length; i++) { // 对files数组里面的File对象递归索引，通过广度遍历
indexDocs(writer, new File(file, files[i]));
}
}
} else { // 到达叶节点时，说明是一个File，而不是目录，则建立索引
System.out.println("adding " + file);
try {
writer.addDocument(FileDocument.Document(file));
}
catch (FileNotFoundException fnfe) {
;
}
}
}
}

上面这一句：

writer.addDocument(FileDocument.Document(file));

其实做了很多工作。每当递归到叶子节点，获得一个文件，而非目录文件，比如文件myWorld.txt。然后对这个文件进行了复杂的操作：

先根据由myWorld.txt构造的File对象f，通过f获取myWorld.txt的具体信息，比如存储路径、修改时间等等，构造多个Field对象，再由这些不同Field的聚合，构建出一个Document对象，最后把Document对象加入索引器IndexWriter对象中，通过索引器可以对这些聚合的Document 的Field中信息进行分词、过滤处理，方便检索。

[java] view plain copy

org.apache.lucene.demo.FileDocument类的源代码如下所示：
package org.apache.lucene.demo;
import java.io.File;
import java.io.FileReader;
import org.apache.lucene.document.DateTools;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
public class FileDocument {
public static Document Document(File f)
throws java.io.FileNotFoundException {
// 实例化一个Document
Document doc = new Document();
// 根据传进来的File f，构造多个Field对象，然后把他们都添加到Document中
// 通过f的所在路径构造一个Field对象，并设定该Field对象的一些属性：
// “path”是构造的Field的名字，通过该名字可以找到该Field
// Field.Store.YES表示存储该Field；Field.Index.UN_TOKENIZED表示不对该Field进行分词，但是对其进行索引，以便检索
doc.add(new Field("path", f.getPath(), Field.Store.YES, Field.Index.UN_TOKENIZED));
// 构造一个具有最近修改修改时间信息的Field
doc.add(new Field("modified",
DateTools.timeToString(f.lastModified(), DateTools.Resolution.MINUTE),
Field.Store.YES, Field.Index.UN_TOKENIZED));
// 构造一个Field，这个Field可以从一个文件流中读取，必须保证由f所构造的文件流是打开的
doc.add(new Field("contents", new FileReader(f)));
return doc;
}
private FileDocument() {}
}

通过上面的代码，可以看出Field是何其的重要，必须把Field完全掌握了。

Field类定义了两个很有用enum：Store和Index，用它们来设置对Field进行索引时的一些属性。

[java] view plain copy

/** Specifies whether and how a field should be stored. */
public static enum Store {
/** Store the original field value in the index. This is useful for short texts
* like a document's title which should be displayed with the results. The
* value is stored in its original form, i.e. no analyzer is used before it is
* stored.
*/
YES {
@Override
public boolean isStored() { return true; }
},
/** Do not store the field value in the index. */
NO {
@Override
public boolean isStored() { return false; }
};
public abstract boolean isStored();
}
/** Specifies whether and how a field should be indexed. */
public static enum Index {
/** Do not index the field value. This field can thus not be searched,
* but one can still access its contents provided it is
* {@link Field.Store stored}. */
NO {
@Override
public boolean isIndexed() { return false; }
@Override
public boolean isAnalyzed() { return false; }
@Override
public boolean omitNorms() { return true; }
},
/** Index the tokens produced by running the field's
* value through an Analyzer. This is useful for
* common text. */
ANALYZED {
@Override
public boolean isIndexed() { return true; }
@Override
public boolean isAnalyzed() { return true; }
@Override
public boolean omitNorms() { return false; }
},
/** Index the field's value without using an Analyzer, so it can be searched.
* As no analyzer is used the value will be stored as a single term. This is
* useful for unique Ids like product numbers.
*/
NOT_ANALYZED {
@Override
public boolean isIndexed() { return true; }
@Override
public boolean isAnalyzed() { return false; }
@Override
public boolean omitNorms() { return false; }
},
/** Expert: Index the field's value without an Analyzer,
* and also disable the storing of norms. Note that you
* can also separately enable/disable norms by calling
* {@link Field#setOmitNorms}. No norms means that
* index-time field and document boosting and field
* length normalization are disabled. The benefit is
* less memory usage as norms take up one byte of RAM
* per indexed field for every document in the index,
* during searching. Note that once you index a given
* field <i>with</i> norms enabled, disabling norms will
* have no effect. In other words, for this to have the
* above described effect on a field, all instances of
* that field must be indexed with NOT_ANALYZED_NO_NORMS
* from the beginning. */
NOT_ANALYZED_NO_NORMS {
@Override
public boolean isIndexed() { return true; }
@Override
public boolean isAnalyzed() { return false; }
@Override
public boolean omitNorms() { return true; }
},
/** Expert: Index the tokens produced by running the
* field's value through an Analyzer, and also
* separately disable the storing of norms. See
* {@link #NOT_ANALYZED_NO_NORMS} for what norms are
* and why you may want to disable them. */
ANALYZED_NO_NORMS {
@Override
public boolean isIndexed() { return true; }
@Override
public boolean isAnalyzed() { return true; }
@Override
public boolean omitNorms() { return true; }
};

Field类中还有一个内部类，它的声明如下：

[java] view plain copy

public static enum TermVector {
/** Do not store term vectors.
*/
NO {
@Override
public boolean isStored() { return false; }
@Override
public boolean withPositions() { return false; }
@Override
public boolean withOffsets() { return false; }
},
/** Store the term vectors of each document. A term vector is a list
* of the document's terms and their number of occurrences in that document. */
YES {
@Override
public boolean isStored() { return true; }
@Override
public boolean withPositions() { return false; }
@Override
public boolean withOffsets() { return false; }
},
/**
* Store the term vector + token position information
*
* @see #YES
*/
WITH_POSITIONS {
@Override
public boolean isStored() { return true; }
@Override
public boolean withPositions() { return true; }
@Override
public boolean withOffsets() { return false; }
},
/**
* Store the term vector + Token offset information
*
* @see #YES
*/
WITH_OFFSETS {
@Override
public boolean isStored() { return true; }
@Override
public boolean withPositions() { return false; }
@Override
public boolean withOffsets() { return true; }
},
/**
* Store the term vector + Token position and offset information
*
* @see #YES
* @see #WITH_POSITIONS
* @see #WITH_OFFSETS
*/
WITH_POSITIONS_OFFSETS {
@Override
public boolean isStored() { return true; }
@Override
public boolean withPositions() { return true; }
@Override
public boolean withOffsets() { return true; }
};

这是一个与词条有关的枚举类型。

在3.0之前的lucene中，通常store index termvector都是被设置为静态内部类。。3.0开始设置为枚举类型。。。。。。

同时，Field的值可以构造成很多类型，Field类中定义了4种：String、Reader、byte[]、TokenStream。

然后就是Field对象的构造，应该看它的构造方法，它有9种构造方法：

还要注意了，通过Field类的声明：

public final class Field extends AbstractField implements Fieldable , Serializable

可以看出，应该对它继承的父类AbstractField类有一个了解，下面的是AbstractField类的属性：

[java] view plain copy

protected String name = "body";
protected boolean storeTermVector = false;
protected boolean storeOffsetWithTermVector = false;
protected boolean storePositionWithTermVector = false;
protected boolean omitNorms = false;
protected boolean isStored = false;
protected boolean isIndexed = true;
protected boolean isTokenized = true;
protected boolean isBinary = false;
protected boolean isCompressed = false;
protected boolean lazy = false;
protected float boost = 1.0f;
protected Object fieldsData = null;

还有Field实现了Fieldable接口，添加了一些对对应的Document中的Field进行管理判断的方法信息。