1.5.6 Filters

Filters

  过滤器filter应该跟在tokenizer或者另一个filter之后.因为它们将TokenStream作为输入源.

<fieldType name="text" class="solr.TextField">
    <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory" />
        <filter class="solr.LowerCaseFilterFactory" />
        ...
    </analyzer>
</fieldType>

 class属性命名了一个工厂类用来实例化一个filter对象.Filter工厂类必须实现org.apache.solr.analysis.TokenFilterFactory接口. 像Tokenizer一样,filter也是TokenStream的实例,因此也是tokens的生产者.不像tokenizer的是,filter同样也处理来自tokenStream的tokens.

<fieldType name="semicolonDelimited" class="solr.TextField">
    <analyzer type="query">
        <tokenizer class="solr.PatternTokenizerFactory" pattern="; " />
        <filter class="solr.LengthFilterFactory" min="2" max="7" />
    </analyzer>
</fieldType>

solr4.7.2中包含的过滤器工厂如下,更多信息参考http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters.

ASCII Folding Filter

  这个过滤器转换字母,数字,象征性的Unicode字符(不包含在基础Latin unicode模块--第一个127 ASCII字符)为ASCII等同物.如果存在,这个过滤器从以下Unicode模块转换字符:

C1 Controls and Latin-1 Supplement (PDF)
Latin Extended-A (PDF)
Latin Extended-B (PDF)
Latin Extended Additional (PDF)
Latin Extended-C (PDF)
Latin Extended-D (PDF)
IPA Extensions (PDF)
Phonetic Extensions (PDF)
Phonetic Extensions Supplement (PDF)
General Punctuation (PDF)
Superscripts and Subscripts (PDF)
Enclosed Alphanumerics (PDF)
Dingbats (PDF)
Supplemental Punctuation (PDF)
Alphabetic Presentation Forms (PDF)
Halfwidth and Fullwidth Forms (PDF)

工厂类:solr.ASCIIFoldingFilterFactory

参数:None

例子:

<analyzer>
    <filter class="solr.ASCIIFoldingFilterFactory" />
</analyzer>

输入:"á" (Unicode character 00E1)

输出:"a" (ASCII character 97)

Beider-Morse Filter

  实现Beider-Morse语音匹配(BMPM)算法,允许识别相似的名字,即使它们使用不用的语言或者不同的拼写方式拼写.

  工厂类:solr.BeiderMorseFilterFactory

  参数:

    nameType:类型名称,有效值是GENERIC, ASHKENAZI, or SEPHARDIC.如果没有处理德系犹太人和西班牙系犹太人的名字,使用GENERIC.

    ruleType:规则类型,有效值为APPROX or EXACT.

    concat:定义是否多个可能匹配,使用"|"连接.

    languageSet:语言设置,"auto"允许filter识别语言.支持逗号分割列表.

  例子:

<analyzer>
    <tokenizer class="solr.StandardTokenizerFactory" />
    <filter class="solr.BeiderMorseFilterFactory" nameType="GENERIC"
        ruleType="APPROX" concat="true" languageSet="auto">
    </filter>
</analyzer>

Classic Filter

  

  工厂类:solr.ClassicFilterFactory

  参数:None

  例子:

<analyzer>
    <tokenizer class="solr.ClassicTokenizerFactory" />
    <filter class="solr.ClassicFilterFactory" />
</analyzer>

  输入: "I.B.M. cat's can't"

  Tokenizer to Filter: "I.B.M", "cat's", "can't"

  输出:"IBM", "cat", "can't"

Common Grams Filter

  在创建短语查询,如 "the cat." 时,这个过滤器还是很有用的.solr一般在查询短语时会忽略停用词.所以搜索"the cat"时,会返回所有匹配"cat" 的结果.

  工厂类:solr.CommonGramsFilterFactory

  参数:

    words:(.txt格式的普通单词文件),如stopwords.txt.

    format:(可选),如果停用词列表格式化为Snowball,可以指定format="snowball",所以solr可以读取停用词文件.

    ignoreCase:(boolean值),如果为true,在和普通单词文件对比时solr忽略单词大小写.默认为false.

  例子: 

<analyzer>
    <tokenizer class="solr.StandardTokenizerFactory" />
    <filter class="solr.CommonGramsFilterFactory" words="stopwords.txt"
        ignoreCase="true" />
</analyzer>

  输入:"the Cat"
  Tokenizer to Filter:
"the", "Cat"

  输出: "the_cat"

Collation Key Filter

  允许采用语言相关的方式排序.通常用于排序,但同样可以用于高级搜索,在 Unicode Collation部分将介绍更多的细节.

Edge N-Gram Filter

  工厂类:solr.EdgeNGramFilterFactory

  参数:

    minGramSize:(integer ,默认为1),最小gram大小.

    maxGramSize:(integer ,默认为1),最大gram大小.

  例子:

    默认动作:

<analyzer>
    <tokenizer class="solr.StandardTokenizerFactory" />
    <filter class="solr.EdgeNGramFilterFactory" />
</analyzer>

  输入: "four score and twenty"
  Tokenizer to Filter: "four", "score", "and", "twenty"
  输出: "f", "s", "a", "t"

   例子:

    1到4的范围

<analyzer>
    <tokenizer class="solr.StandardTokenizerFactory" />
    <filter class="solr.EdgeNGramFilterFactory" minGramSize="1"
        maxGramSize="4" />
</analyzer>

  输入: "four score"
  Tokenizer to Filter: "four", "score"
  输出: "f", "fo", "fou", "four", "s", "sc", "sco", "scor"

  例子:

    4到6的范围

<analyzer>
    <tokenizer class="solr.StandardTokenizerFactory" />
    <filter class="solr.EdgeNGramFilterFactory" minGramSize="4"  maxGramSize="6" />
</analyzer>

  输入: "four score and twenty"
  Tokenizer to Filter: "four", "score", "and", "twenty"
  输出: "four", "scor", "score", "twen", "twent", "twenty

English Minimal Stem Filter

  英语最小词干过滤器.

  这个过滤器抽取英语复数单词为单数形式.

  工厂类:solr.EnglishMinimalStemFilterFactory

  参数:None

  例子:

<analyzer type="index">
    <tokenizer class="solr.StandardTokenizerFactory " />
    <filter class="solr.EnglishMinimalStemFilterFactory" />
</analyzer>

  输入: "dogs cats"
  Tokenizer to Filter: "dogs", "cats"
  输出: "dog", "cat"

Hyphenated Words Filter

  带有连字符的词语过滤器

  这个过滤器重新构造连字符单词,由于一条横线或者是介于中间的空格,可以将单词分解成两个token.如果一个token以连字符结束,那么它将会连接下一个token,然后废弃连字符.注意:为了使过滤器恰当的工作,上面的分词器不能处理掉连字符,这个过滤器一般在建立索引的时候起作用.

  工厂类: solr.HyphenatedWordsFilterFactory

  参数:None

  例子:

<analyzer type="index">
    <tokenizer class="solr.WhitespaceTokenizerFactory" />
    <filter class="solr.HyphenatedWordsFilterFactory" />
</analyzer>

   输入: "A hyphen- ated word"
  Tokenizer to Filter: "A", "hyphen-", "ated", "word"
  输出: "A", "hyphenated", "word"

Keep Words Filter

  这个过滤器只保留给出的单词列表中的token.这恰恰和停用词过滤器相反.在为一个限制性词库建立指定的索引时是非常有用的.

  工厂类:solr.KeepWordFilterFactory

  参数:

    words:(必填),text文件路径,这可能是一个绝对路径或一个简单的文件名.

    ignoreCase:(true/false),true,忽略文件中单词的大小写,假设全部是小写.默认为false.

  例子:

    keepwords.txt 包含以下词语:

      happy
      funny
      silly

<analyzer>
    <tokenizer class="solr.StandardTokenizerFactory" />
    <filter class="solr.KeepWordFilterFactory" words="keepwords.txt" />
</analyzer>

  输入: "Happy, sad or funny"
  Tokenizer to Filter: "Happy", "sad", "or", "funny"

  输出:"funny"

  例子:

    keepwords.txt忽略大小写.

<analyzer>
    <tokenizer class="solr.StandardTokenizerFactory" />
    <filter class="solr.KeepWordFilterFactory" words="keepwords.txt" ignoreCase="true"/>
</analyzer>

  输入: "Happy, sad or funny"
  Tokenizer to Filter: "Happy", "sad", "or", "funny"
  输出: "Happy", "funny"

  

  例子:

    在使用这个过滤器之前,加入LowerCaseFilterFactory.

<analyzer>
    <tokenizer class="solr.StandardTokenizerFactory" />
    <filter class="solr.LowerCaseFilterFactory" />
    <filter class="solr.KeepWordFilterFactory" words="keepwords.txt" />
</analyzer>

   输入: "Happy, sad or funny"
  Tokenizer to Filter: "Happy", "sad", "or", "funny"
  Filter to Filter: "happy", "sad", "or", "funny"
  输出: "happy", "funny"

KStem Filter

  这个过滤器是Porter Stem Filter的一个替代,用于查询不具太多侵犯性的词干.KStem是 Bob Krovetz编写的,通过Sergio Guzman-Lara (UMASS Amherst)传给lucene的.这个stemmer仅仅用于英语文本语言.

  工厂类:solr.KStemFilterFactory

  参数:None

  例子:

<analyzer type="index">
    <tokenizer class="solr.StandardTokenizerFactory " />
    <filter class="solr.KStemFilterFactory" />
</analyzer>

  输入: "jump jumping jumped"

  Tokenizer to Filter: "jump", "jumping", "jumped"

  输出: "jump", "jump", "jump"

Length Filter

  长度过滤器

  工厂类: solr.LengthFilterFactory

  参数:

    min:(integer 必填)

    max:(integer 必填)

  例子:

<analyzer>
    <tokenizer class="solr.StandardTokenizerFactory" />
    <filter class="solr.LengthFilterFactory" min="3" max="7" />
</analyzer>

  输入:"turn right at Albuquerque"

  Tokenizer to Filter: "turn", "right", "at", "Albuquerque"

  输出:"turn", "right"

Lower Case Filter

  在token中,将大写字母转换为对应的小写字母.

  工厂类:solr.LowerCaseFilterFactory

  参数:None

  例子:

<analyzer>
    <tokenizer class="solr.StandardTokenizerFactory" />
    <filter class="solr.LowerCaseFilterFactory" />
</analyzer>

  输入:"Down With CamelCase"

  Tokenizer to Filter: "Down", "With", "CamelCase"

  输出: "down", "with", "camelcase"

N-Gram Filter

  工厂类:solr.NGramFilterFactory

  参数:

    minGramSize:(integer,默认为1),最小gram大小

    maxGramSize:(integer,默认为2),最大gram大小

  例子:

    默认行为

<analyzer>
    <tokenizer class="solr.StandardTokenizerFactory" />
    <filter class="solr.NGramFilterFactory" />
</analyzer>

  输入: "four score"
  Tokenizer to Filter: "four", "score"
  输出: "f", "o", "u", "r", "fo", "ou", "ur", "s", "c", "o", "r", "e", "sc", "co", "or", "re"

  

  例子:

    范围1到4

<analyzer>
    <tokenizer class="solr.StandardTokenizerFactory" />
    <filter class="solr.NGramFilterFactory" minGramSize="1" maxGramSize="4" />
</analyzer>

  输入:"four score"

  Tokenizer to Filter: "four", "score"

  输出:"f", "fo", "fou", "four", "s", "sc", "sco", "scor"

  例子:

    范围3到5

<analyzer>
    <tokenizer class="solr.StandardTokenizerFactory" />
    <filter class="solr.NGramFilterFactory" minGramSize="3" maxGramSize="5" />
</analyzer>

  输入: "four score"

  Tokenizer to Filter: "four", "score"

  输出:"fou", "four", "our", "sco", "scor", "score", "cor", "core", "ore"

Numeric Payload Token Filter

  该过滤器提供了一个数字的浮点负载的值到tokens中用于匹配给定的类型.参考文档org.apache.lucene.analysis.Token的类,来获取更多关于token类型和负载的信息.

  工厂类:solr.NumericPayloadTokenFilterFactory

  参数:

    payload:(必填),一个浮点值将被添加到所有匹配的tokens中.

    typeMatch:(必填),token类型,name为字符串,持有匹配类型名称的tokens,将会设置上面的浮点值到它们的负载中.

  例子:

<analyzer>
    <tokenizer class="solr.WhitespaceTokenizerFactory" />
    <filter class="solr.NumericPayloadTokenFilterFactory" payload="0.75" typeMatch="word" />
</analyzer>

  输入:"bing bang boom"

  Tokenizer to Filter: "bing", "bang", "boom"

  输出: "bing"[0.75], "bang"[0.75], "boom"[0.75]

Pattern Replace Filter

  模式替换过滤器

  这个过滤器对于每一个token,使用了一个正则表达式,对于匹配的token,用给出的替代字符串取代匹配的token.没有匹配的,传递不做处理.

  工厂类:solr.PatternReplaceFilterFactory

  参数:

    pattern:(必填),正则表达式,正如  java.util.regex.Pattern

    replacement:(必填),替换字符串,这个字符串可能关联在pattern中获取的groups.参考 java.util.regex.Matcher.

    replace:("all" or "first", default "all"),

  例子:

    简单字符串替换 

<analyzer>
    <tokenizer class="solr.StandardTokenizerFactory" />
    <filter class="solr.PatternReplaceFilterFactory" pattern="cat" replacement="dog" />
</analyzer>

  输入: "cat concatenate catycat"

  Tokenizer to Filter: "cat", "concatenate", "catycat"

  输出:"dog", "condogenate", "dogydog"

  例子:

    字符串替换,仅仅替换每个token中的第一个匹配字符串.

<analyzer>
    <tokenizer class="solr.StandardTokenizerFactory" />
    <filter class="solr.PatternReplaceFilterFactory" pattern="cat"
        replacement="dog" replace="first" />
</analyzer>

  输入: "cat concatenate catycat"

  Tokenizer to Filter:"cat", "concatenate", "catycat"

  输出: "dog", "condogenate", "dogycat"

  例子:

    一个比较复杂的模式,token以非数字字符开头,以数字字符结束,并且在数字之前使用下划线替代,否则传递token.

<analyzer>
    <tokenizer class="solr.StandardTokenizerFactory" />
    <filter class="solr.PatternReplaceFilterFactory" pattern="(D+)(d+)$"
        replacement="$1_$2" />
</analyzer>

  输入:"cat foo1234 9987 blah1234foo"

  Tokenizer to Filter: "cat", "foo1234", "9987", "blah1234foo"

  输出:"cat", "foo_1234", "9987", "blah1234foo"

Phonetic Filter

  语音过滤器

  这个过滤器在.language package. org.apache.commons.codec语言包中使用语音编码算法创建tokens.

  工厂类: solr.PhoneticFilterFactory

  参数:

    encoder:(必填),编码使用的名称,名称必须是下面中的一个(忽略大小写). "DoubleMetaphone","Metaphone","Soundex","RefinedSoundex","Caverphone","ColognePhonetic".

    inject:(true/false),如果为true(默认),新的音标tokens将被添加到stream中,否则tokens将被相应的音标取代.设置为false,将会开启音标匹配,但是目标单词的精确拼写可能不会匹配.

    maxCodeLength:(integer),编码的最大长度.

  例子:

    定义DoubleMetaphone编码行为:

<analyzer>
    <tokenizer class="solr.StandardTokenizerFactory" />
    <filter class="solr.PhoneticFilterFactory" encoder="DoubleMetaphone" />
</analyzer>

  输入:"four score and twenty"

  Tokenizer to Filter:"four"(1), "score"(2), "and"(3), "twenty"(4)

  输出:"four"(1), "FR"(1), "score"(2), "SKR"(2), "and"(3), "ANT"(3), "twenty"(4), "TNT"(4)

  例子:

    抛弃原始的token.

<analyzer>
    <tokenizer class="solr.StandardTokenizerFactory" />
    <filter class="solr.PhoneticFilterFactory" encoder="DoubleMetaphone" inject="false" />
</analyzer>

  输入: "four score and twenty"

  Tokenizer to Filter:"four"(1), "score"(2), "and"(3), "twenty"(4)

  输出:"FR"(1), "SKR"(2), "ANT"(3), "TWNT"(4)

  例子:

    默认Soundex编码:

<analyzer>
    <tokenizer class="solr.StandardTokenizerFactory" />
    <filter class="solr.PhoneticFilterFactory" encoder="Soundex" />
</analyzer>

  输入:"four score and twenty"

  Tokenizer to Filter:"four"(1), "score"(2), "and"(3), "twenty"(4)

  输出:"four"(1), "F600"(1), "score"(2), "S600"(2), "and"(3), "A530"(3), "twenty"(4), "T530"(4)

Remove Duplicates Token Filter

  删除副本token过滤器.如果tokens具有相同的文本和位置值,那么考虑删除副本.

  工厂类:solr.RemoveDuplicatesTokenFilterFactory

  参数:None

  例子:

    使用RemoveDuplicatesTokenFilterFactory的一个环境--同义词文件:Television, Televisions, TV, TVs

<analyzer>
    <tokenizer class="solr.StandardTokenizerFactory" />
    <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" />
    <filter class="solr.EnglishMinimalStemFilterFactory" />
    <filter class="solr.RemoveDuplicatesTokenFilterFactory" />
</analyzer>

  输入:"Watch TV"

  Tokenizer to Synonym Filter: "Watch"(1) "TV"(2)
  Synonym Filter to Stem Filter: "Watch"(1) "Television"(2) "Televisions"(2) "TV"(2) "TVs"(2)
  Stem Filter to Remove Dups Filter: "Watch"(1) "Television"(2) "Television"(2) "TV"(2) "TV"(2)

  输出: "Watch"(1) "Television"(2) "TV"(2)

Reversed Wildcard Filter

  反转通配符过滤器

  这个过滤器反转tokens,提供更快的领先的通配符和前缀查询.没有通配符的tokens是不会反转的.

  工厂类:solr.ReveresedWildcardFilterFactory

  参数:

    withOriginal:(boolean),如果为true,这个过滤器在相同的位置生成两个原始的和反转的tokens.如果为false,只生成反转的token.

    maxPosAsterisk:(integer,默认为2),触发查询词的反转的星号匹配符(*)的最大位置.带有*号超出这个位置的词(term)不会反转.

    maxPosQuestion:(integer,默认为1),触发查询词的反转的问号匹配符(?)的最大位置.为了反转纯后缀查询(查询只有一个单独的领先的星号).设置这个为0,maxPosAsterisk为1.

    maxFractionAsterisk:(float,默认为0.0),触发反转的一个另外的参数.如果星号的位置小于查询token的长度的话.

    minTrailing:(integer,默认为2),在最后一个通配字符之后,查询token中字符后面跟随的最小数量.为了更好的性能,最好设置大于1的值.

  例子:

<analyzer type="index">
    <tokenizer class="solr.WhitespaceTokenizerFactory" />
    <filter class="solr.ReversedWildcardFilterFactory" withOriginal="true"
        maxPosAsterisk="2" maxPosQuestion="1" minTrailing="2"
        maxFractionAsterisk="0" />
</analyzer>

  输入: "*foo *bar"

  Tokenizer to Filter: "*foo", "*bar"

  输出: "oof*", "rab*"

Shingle Filter

  压载过滤器

  这个过滤器构造了一个Shingle.这个Shingle采用n-gram从token流中分词.它合并运行的token成一个单独的token.

  工厂类:solr.ShingleFilterFactory

  参数:

    minShingleSize:(integer,默认为2),每个压载的最小token数量.

    maxShingleSize:(integer,默认为2,必须>=2),每个压载的最大token数量.

    outputUnigrams:(true/false,默认为true),如果为true,每个单独的token也被包括其原始位置.

    outputUnigramsIfNoShingles:(true/false,默认为false),如果为true,在没有shingles可能的情况下,单个token将被输出.

    tokenSeparator:(string,默认为" "),在连接相邻的tokens以形成一个shingle(压载),默认使用空格.

  例子:

    默认行为  

<analyzer>
    <tokenizer class="solr.StandardTokenizerFactory" />
    <filter class="solr.ShingleFilterFactory" />
</analyzer>

  输入:"To be, or what?"

  Tokenizer to Filter:"To"(1), "be"(2), "or"(3), "what"(4)

  输出:"To"(1), "To be"(1), "be"(2), "be or"(2), "or"(3), "or what"(3), "what"(4)

  例子:

    一个shingle(压载)大小为4, 不包含原始的token.

<analyzer>
    <tokenizer class="solr.StandardTokenizerFactory" />
    <filter class="solr.ShingleFilterFactory" maxShingleSize="4" outputUnigrams="false" />
</analyzer>

  输入:"To be, or not to be."

  Tokenizer to Filter:"To"(1), "be"(2), "or"(3), "not"(4), "to"(5), "be"(6)

  输出:"To be"(1), "To be or"(1), "To be or not"(1), "be or"(2), "be or not"(2), "be or not to"(2), "or not"(3), "or not to"(3), "or not to be"(3), "not to"(4),"not to be"(4), "to be"(5)

Standard Filter

  标准过滤器

  这个过滤器删除首字母缩写词中的点或者token末尾中的"s".这个过滤器依赖于这样的tokens,这些tokens被打上了适当的term-type,以识别缩略词和省略词.

  工厂类:solr.StandardFilterFactory

  参数:None

  注意:在solr3.1以后,这个过滤器不在操作使用.


Stop Filter

  工厂类: solr.StopFilterFactory

  参数:

    words:(可选),文件路径,文件在的单词,一个一行.

    format:(可选),如果停用词已经被Snowball格式化,那么使用format="snowball".这样solr才能使用这个文件.

    ignoreCase:(true/false, default false).

  例子:

    大小写匹配,大写单词不被停用,Token位置跳过停用单词.

<analyzer>
    <tokenizer class="solr.StandardTokenizerFactory" />
    <filter class="solr.StopFilterFactory" words="stopwords.txt" />
</analyzer>

  输入:"To be or what?"

  Tokenizer to Filter:"To"(1), "be"(2), "or"(3), "what"(4)

  输出: "To"(1), "what"(4)

  

  例子:

<analyzer>
    <tokenizer class="solr.StandardTokenizerFactory" />
    <filter class="solr.StopFilterFactory" words="stopwords.txt"
        ignoreCase="true" />
</analyzer>

  输入:"To be or what?"

  Tokenizer to Filter:"To"(1), "be"(2), "or"(3), "what"(4)

  输出: "what"(4)

Synonym Filter

  同义过滤器

  这个过滤器做同义词映射,每个token都查找列表中的同义词,如果找到匹配的,这个同义词就会替换这个token,

  工厂类: solr.SynonymFilterFactory

  参数:

    synonyms:(必填),同义词文件路径,这里有两种指定方式:

    • 单词之间以逗号分隔,如果token匹配这组单词中的任何一个,那么这组单词将会代替这个token.
    • 两组逗号分隔的列表,它们之间以"=>"分隔.如果这个token配置左边的任何一个单词,那么右边的那组单词将会替代这个token.这个原始的token将不会包含,除非右边的单词中包含这个token.

  例子:

    下面的例子是同义词文件mysynonyms.txt:

couch,sofa,divan
teh => the
huge,ginormous,humungous => large
small => tiny,teeny,weeny
<analyzer>
    <tokenizer class="solr.StandardTokenizerFactory" />
    <filter class="solr.SynonymFilterFactory" synonyms="mysynonyms.txt" />
</analyzer>

  输入:"teh small couch"

  Tokenizer to Filter: "teh"(1), "small"(2), "couch"(3)

  输出:"the"(1), "tiny"(2), "teeny"(2), "weeny"(2), "couch"(3), "sofa"(3), "divan"(3)

  例子:

<analyzer>
    <tokenizer class="solr.StandardTokenizerFactory " />
    <filter class="solr.SynonymFilterFactory" synonyms="mysynonyms.txt" />
</analyzer>

  输入: "teh ginormous, humungous sofa"

  Tokenizer to Filter:"teh"(1), "ginormous"(2), "humungous"(3), "sofa"(4)

  输出: "the"(1), "large"(2), "large"(3), "couch"(4), "sofa"(4), "divan"(4)


Token Offset Payload Filter

  这个过滤器添加token字符的偏移数值作为token的负载信息.

  工厂类:solr.TokenOffsetPayloadTokenFilterFactory

  参数:None

<analyzer>
    <tokenizer class="solr.WhitespaceTokenizerFactory" />
    <filter class="solr.TokenOffsetPayloadTokenFilterFactory" />
</analyzer>

  输入: "bing bang boom"

  Tokenizer to Filter:"bing", "bang", "boom"

  输出:"bing"[0,4], "bang"[5,9], "boom"[10,14]

Trim Filter

  这个过滤器整理and/or后边的空格,大部分tokens通过空格来分词的,所以这个过滤器通常用于指定的特殊环境.

  工厂类:solr.TrimFilterFactory

  参数:None

  例子:

<analyzer>
    <tokenizer class="solr.PatternTokenizerFactory" pattern="," />
    <filter class="solr.TrimFilterFactory" />
</analyzer>

  输入: "one, two , three ,four "

  Tokenizer to Filter: "one", " two ", " three ", "four "

  输出:"one", "two", "three", "four"

Type As Payload Filter

  这个过滤器添加token的类型作为编码的字节序列,作为它的负载.

  工厂类:solr.TypeAsPayloadTokenFilterFactory

  参数:None

  例子:

<analyzer>
    <tokenizer class="solr.WhitespaceTokenizerFactory" />
    <filter class="solr.TypeAsPayloadTokenFilterFactory" />
</analyzer>

  输入: "Pay Bob's I.O.U."

  Tokenizer to Filter:"Pay", "Bob's", "I.O.U."

  输出:"Pay"[<ALPHANUM>], "Bob's"[<APOSTROPHE>], "I.O.U."[<ACRONYM>]

原文地址:https://www.cnblogs.com/a198720/p/4302735.html