Elasticsearch 之 Hello World (二)

首先测试下分词尤其是中文分词功能，这个可是传统数据库如mysql，sqlserver的痛啊。

打开浏览器，并登录到http://localhost:5601，点击Dev Tools项，在Console栏输入

POST _analyze
{
  "analyzer": "standard",
  "text":"Hello World ElasticSearch"
}

会在右面显示返回的结果

{
  "tokens": [
    {
      "token": "hello",
      "start_offset": 0,
      "end_offset": 5,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "world",
      "start_offset": 6,
      "end_offset": 11,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "elasticsearch",
      "start_offset": 12,
      "end_offset": 25,
      "type": "<ALPHANUM>",
      "position": 2
    }
  ]
}

一切看上去都挺美好，等加入中文看看。

POST _analyze
{
  "analyzer": "standard",
  "text":"ElasticSearch是一个很不错的全文检索软件。"
}

结果是

{
  "tokens": [
    {
      "token": "elasticsearch",
      "start_offset": 0,
      "end_offset": 13,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "是",
      "start_offset": 13,
      "end_offset": 14,
      "type": "<IDEOGRAPHIC>",
      "position": 1
    },
    {
      "token": "一",
      "start_offset": 14,
      "end_offset": 15,
      "type": "<IDEOGRAPHIC>",
      "position": 2
    },
    {
      "token": "个",
      "start_offset": 15,
      "end_offset": 16,
      "type": "<IDEOGRAPHIC>",
      "position": 3
    },
    {
      "token": "很",
      "start_offset": 16,
      "end_offset": 17,
      "type": "<IDEOGRAPHIC>",
      "position": 4
    },
    {
      "token": "不",
      "start_offset": 17,
      "end_offset": 18,
      "type": "<IDEOGRAPHIC>",
      "position": 5
    },
    {
      "token": "错",
      "start_offset": 18,
      "end_offset": 19,
      "type": "<IDEOGRAPHIC>",
      "position": 6
    },
    {
      "token": "的",
      "start_offset": 19,
      "end_offset": 20,
      "type": "<IDEOGRAPHIC>",
      "position": 7
    },
    {
      "token": "全",
      "start_offset": 20,
      "end_offset": 21,
      "type": "<IDEOGRAPHIC>",
      "position": 8
    },
    {
      "token": "文",
      "start_offset": 21,
      "end_offset": 22,
      "type": "<IDEOGRAPHIC>",
      "position": 9
    },
    {
      "token": "检",
      "start_offset": 22,
      "end_offset": 23,
      "type": "<IDEOGRAPHIC>",
      "position": 10
    },
    {
      "token": "索",
      "start_offset": 23,
      "end_offset": 24,
      "type": "<IDEOGRAPHIC>",
      "position": 11
    },
    {
      "token": "软",
      "start_offset": 24,
      "end_offset": 25,
      "type": "<IDEOGRAPHIC>",
      "position": 12
    },
    {
      "token": "件",
      "start_offset": 25,
      "end_offset": 26,
      "type": "<IDEOGRAPHIC>",
      "position": 13
    }
  ]
}

这显然不能忍啊，每个中文字都拆，基本就是不能用的节奏。google下，貌似其还有analyzer为chinese选项，测试发现结果一样。网上搜索发现这里一般用的是smartcn或是IKAnanlyzer插件，有的资料和书就推荐IKAnanlyzer，但这些资料都是基于老版本的es，我去IKAnanlyzer的github上去看了下，发现貌似太监了，所以还是用官方推荐的smartcn吧，下载安装的过程和安装其他插件一致，这里还是推荐离线包安装。安装完，应该要重启es服务才能生效。现在再试试

POST _analyze
{
  "analyzer": "smartcn",
  "text":"ElasticSearch是一个很不错的全文检索软件。"
}

{
  "tokens": [
    {
      "token": "elasticsearch",
      "start_offset": 0,
      "end_offset": 13,
      "type": "word",
      "position": 0
    },
    {
      "token": "是",
      "start_offset": 13,
      "end_offset": 14,
      "type": "word",
      "position": 1
    },
    {
      "token": "一个",
      "start_offset": 14,
      "end_offset": 16,
      "type": "word",
      "position": 2
    },
    {
      "token": "很",
      "start_offset": 16,
      "end_offset": 17,
      "type": "word",
      "position": 3
    },
    {
      "token": "不错",
      "start_offset": 17,
      "end_offset": 19,
      "type": "word",
      "position": 4
    },
    {
      "token": "的",
      "start_offset": 19,
      "end_offset": 20,
      "type": "word",
      "position": 5
    },
    {
      "token": "全文",
      "start_offset": 20,
      "end_offset": 22,
      "type": "word",
      "position": 6
    },
    {
      "token": "检索",
      "start_offset": 22,
      "end_offset": 24,
      "type": "word",
      "position": 7
    },
    {
      "token": "软件",
      "start_offset": 24,
      "end_offset": 26,
      "type": "word",
      "position": 8
    }
  ]
}

这下看上去河蟹多了。:)