ES分词器详解

一、分词器

1、作用:①切词

      ②normalizaton(提升recall召回率:能搜索到的结果的比率)

2、分析器

①character filter:分词之前预处理(过滤无用字符、标签等,转换一些&=>and 《Elasticsearch》=> Elasticsearch

  A、HTML Strip Character Filterhtml_strip

    escaped_tags   需要保留的html标签

PUT my_index
{
  "settings": {
    "analysis": {
      "char_filter": {
        "my_char_filter":{
          "type":"html_strip",
      "escaped_tags":["a"]
} }, "analyzer": { "my_analyzer":{ "tokenizer":"keyword", "char_filter":"my_char_filter" } } } } }
测试分词

  GET my_index/_analyze
  {
    "analyzer": "my_analyzer",
    "text": "liuyucheng <a><b>edu</b></a>"
  }

 

  B、Mapping Character Filtertype mapping

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "keyword",
          "char_filter": [
            "my_char_filter"
          ]
        }
      },
      "char_filter": {
        "my_char_filter": {
          "type": "mapping",
          "mappings": [
            "٠ => 0",
            "١ => 1",
            "٢ => 2",
            "٣ => 3",
            "٤ => 4",
            "٥ => 5",
            "٦ => 6",
            "٧ => 7",
            "٨ => 8",
            "٩ => 9"
          ]
        }
      }
    }
  }
}
测试分词 POST my_index
/_analyze { "analyzer": "my_analyzer", "text": "My license plate is ٢٥٠١٥" }

  C、Pattern Replace Character Filter:正则替换type pattern_replace

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "standard",
          "char_filter": ["my_char_filter"]
        }
      },
      "char_filter": {
        "my_char_filter": {
          "type": "pattern_replace",
          "pattern": "(\d+)-(?=\d)",
          "replacement": "$1_"
        }
      }
    }
  }
}
测试分词 POST my_index
/_analyze { "analyzer": "my_analyzer", "text": "My credit card is 123-456-789" }

②tokenizer:分词器

③token filter:时态转换、大小写转换、同义词转换、语气词处理等

        比如:has=>have  him=>he  apples=>apple  the/oh/a=>干掉

  A、大小写 lowercase token filter

GET _analyze
{
  "tokenizer" : "standard",
  "filter" : ["lowercase"],
  "text" : "THE Quick FoX JUMPs"
}

GET /_analyze
{
  "tokenizer": "standard",
  "filter": [
    {
      "type": "condition",
      "filter": [ "lowercase" ],
      "script": {
        "source": "token.getTerm().length() < 5"
      }
    }
  ],
  "text": "THE QUICK BROWN FOX"
}

  B、停用词 stopwords token filter

PUT /my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer":{
          "type":"standard",
          "stopwords":"_english_"
        }
      }
    }
  }
}
GET my_index/_analyze
{
  "analyzer": "my_analyzer",
  "text": "Teacher Ma is in the restroom"
}

  C、分词器  tokenizer :standard

GET /my_index/_analyze
{
  "text": "江山如此多娇,小姐姐哪里可以撩",
  "analyzer": "standard"
}

  D、自定义 analysis,设置type为custom告诉Elasticsearch我们正在定义一个定制分析器。将此与配置内置分析器的方式进行比较: type将设置为内置分析器的名称,如 standard或simple

PUT /test_analysis
{
  "settings": {
    "analysis": {
      "char_filter": {
        "test_char_filter": {
          "type": "mapping",
          "mappings": [
            "& => and",
            "| => or"
          ]
        }
      },
      "filter": {
        "test_stopwords": {
          "type": "stop",
          "stopwords": ["is","in","at","the","a","for"]
        }
      },
      "tokenizer": {
        "punctuation": { 
          "type": "pattern",
          "pattern": "[ .,!?]"
        }
      },
      "analyzer": {
        "my_analyzer": {
          "type": "custom",
          "char_filter": [
            "html_strip",
            "test_char_filter"
          ],
          "tokenizer": "standard",
          "filter": ["lowercase","test_stopwords"]
        }
      }
    }
  }
}

GET /test_analysis/_analyze
{
  "text": "Teacher ma & zhang also thinks [mother's friends] is good | nice!!!",
  "analyzer": "my_analyzer"
}

  E、创建mapping时候指定分词器

PUT /test_analysis/_mapping/my_type
{
  "properties": {
    "content": {
      "type": "text",
      "analyzer": "test_analysis"
    }
  }
}

 二、中文分词器

(1) 中文分词器:

  ① IK分词:ES的安装目录  不要有中文  空格

  1) 下载:https://github.com/medcl/elasticsearch-analysis-ik

  2) 创建插件文件夹 cd your-es-root/plugins/ && mkdir ik

  3) 将插件解压缩到文件夹 your-es-root/plugins/ik

  4) 重新启动es

  ② 两种analyzer

  1) ik_max_word细粒度

  2) ik_smart粗粒度

  ③ IK文件描述

  1) IKAnalyzer.cfg.xml:IK分词配置文件

  2) 主词库:main.dic

  3) 英文停用词stopword.dic不会建立在倒排索引中

  4) 特殊词库:

  1. quantifier.dic:特殊词库:计量单位等
  2. suffix.dic:特殊词库:后缀名
  3. surname.dic:特殊词库:百家姓
  4. preposition特殊词库:语气词

  5) 自定义词库:比如当下流行词:857、emmm...、渣女、舔屏、996

  6) 热更新:

  1. 修改ik分词器源码
  2. 基于ik分词器原生支持的热更新方案,部署一个web服务器,提供一个http接口,通过modifiedtag两个http响应头,来提供词语的热更新
原文地址:https://www.cnblogs.com/lyc-code/p/13686642.html