Elasticsearch学习笔记

　　关于采用mongodb等nosql还是es作为存储机制，网上有一些讨论，LZ推荐参考https://blog.csdn.net/awdac/article/details/78117393，简单地说就是es可以认为是相比redis更加智能的加速层，但是它不应该作为直接存储机制，这一点和很多数据库的缓存机制是类似的，例如oracle的结果集缓存、timesten，mysql的query cache，只不过针对的场景不同，例如可以结合语义搜索。所以它的写入效率是比较低的，同时相比redis而言，它要重的多。

Wikipedia使用Elasticsearch作为全文检索的工具
GitHub使用Elasticsearch搜索代码
基于Lucene，Elasticsearch之于SQL，Lucene就像RDBMS引擎
使用java编写

启动 ./bin/elasticsearch -d 后台模式
http://localhost:9200/?pretty 查看版本等基本信息
配置文件config/elasticsearch.yml
原生为集群模式，类似rocketmq和kafka
节点间使用9300通信
请求格式'<PROTOCOL>://<HOST>/<PATH>?<QUERY_STRING>' -d '<BODY>'，BODY为JSON编码的请求体
Elasticsearch使用JSON作为序列化格式。
数据库和ES的对应关系如下：
Relational DB ⇒ Databases ⇒ Tables ⇒ Rows ⇒ Columns
Elasticsearch ⇒ Indices ⇒ Types ⇒ Documents ⇒ Fields
一个ES集群包含多个indices。index是一个逻辑命名空间，指向一个或多个shards，相当于oracle的segment。shard是Lucene的一个实例。Shards是Elasticsearch在集群内分布数据的单位。Elasticsearch会根据cluster的扩展和收缩自动在节点间迁移shards。一个shard可能是primary或replica。这跟couchbase的集群管理模式是一样的。默认情况下，一个index中有5个primary shards。

创建索引
PUT http://localhost:9200/blogs
{
"settings" : {
"number_of_shards" : 3,
"number_of_replicas" : 1
}
}
创建文档，用postman PUT http://localhost:9200/megacorp/employee/1 -d '{
"first_name" : "John",
"last_name" : "Smith",
"age" : 25,
"about" : "I love to go rock climbing",
"interests": [ "sports", "music" ]
}'
返回 {"_index":"megacorp","_type":"employee","_id":"2","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"created":true}
如果没有设置ID，则ES会自动生成一个。如：
{
"_index": "megacorp",
"_type": "employee",
"_id": "1",
"_version": 13,
"result": "updated",
"_shards": {
"total": 2,
"successful": 1,
"failed": 0
},
"created": false
}
_version代表更改的次数，一般来说，id不应该自动生成。
文档存储在哪个shard中的公式如下：
shard = hash(routing) % number_of_primary_shards
routing默认是_id。
默认情况下，replication=sync。默认情况下replica=1。
会自动创建index megacorp，声明类型为employee，编号为1
搜索文档，
GET http://localhost:9200/megacorp/employee/1
存在
{"_index":"megacorp","_type":"employee","_id":"1","_version":1,"found":true,"_source":{
"first_name" : "John",
"last_name" : "Smith",
"age" : 25,
"about" : "I love to go rock climbing",
"interests": [ "sports", "music" ]
}}
_source中包含JSON原文档，
http://localhost:9200/megacorp/employee/111
不存在
{"_index":"megacorp","_type":"employee","_id":"111","found":false}
同时HTTP HEAD为404
查询指定字段
GET http://localhost:9200/megacorp/employee/1?_source=first_name
{
"_index": "megacorp",
"_type": "employee",
"_id": "1",
"_version": 13,
"found": true,
"_source": {
"first_name": "John"
}
}

删除
DELETE http://localhost:9200/megacorp/employee/111
{"found":true,"_index":"megacorp","_type":"employee","_id":"1","_version":2,"result":"deleted","_shards":{"total":2,"successful":1,"failed":0}}

精确搜索就没有必要使用ES了，所以模糊搜索才是关键。
/_search
{
"took": 5,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 5,
"max_score": 1,
"hits": [
{
"_index": "megacorp",
"_type": "employee",
"_id": "AV3Kp7BqVnBASvmzDScd",
"_score": 1,
"_source": {
"first_name": "John",
"last_name": "Smith",
"age": 25,
"about": "I love to go rock climbing",
"interests": [
"sports",
"music"
]
}
},
{
"_index": "megacorp",
"_type": "employee",
"_id": "2",
"_score": 1,
"_source": {
"first_name": "Jane",
"last_name": "Smith",
"age": 32,
"about": "I like to collect rock albums",
"interests": [
"music"
]
}
},
{
"_index": "megacorp",
"_type": "employee",
"_id": "AV3Kp5hsVnBASvmzDScc",
"_score": 1,
"_source": {
"first_name": "John",
"last_name": "Smith",
"age": 25,
"about": "I love to go rock climbing",
"interests": [
"sports",
"music"
]
}
},
{
"_index": "megacorp",
"_type": "employee",
"_id": "1",
"_score": 1,
"_source": {
"first_name": "John",
"last_name": "Smith",
"age": 25,
"about": "I love to go rock climbing",
"interests": [
"sports",
"music"
]
}
},
{
"_index": "megacorp",
"_type": "employee",
"_id": "3",
"_score": 1,
"_source": {
"first_name": "Douglas",
"last_name": "Fir",
"age": 35,
"about": "I like to build cabinets",
"interests": [
"forestry"
]
}
}
]
}
}
默认情况下，hits返回符合条件的前面10行，_score从高到低。如果要分页，则需要加上：
http://localhost:9200/megacorp/employee/_search?size=2&from=2
搜索所有字段，真正的全文检索
http://localhost:9200/megacorp/employee/_search?q=John 在后台，其实是查询所有字段，内部有一个隐含的_all字段，类型为string。
各种语法可以参考https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html#query-string-syntax

type的结构（mapping/模式定义）
GET http://localhost:9200/megacorp/_mapping/employee
{
"megacorp": {
"mappings": {
"employee": {
"properties": {
"about": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"age": {
"type": "long"
},
"first_name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"interests": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
},
"fielddata": true
},
"last_name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
}
}
es会自动推断最合适的类型，比如text/long/date。实际上ES也是强类型语义的，如果long被不恰当的定义为string，在全文检索时将导致非预期的结果。除了默认的定义外，field可以自定义mapping属性，通常是index（用于控制某字段支持精确匹配、模糊匹配还是不支持搜索）和analyzer（声明分析器）这两个属性。不过mapping不能修改，只能在创建时或者新增字段时指定。
Lucene不支持存储null值。
到config文件夹下的elasticsearch.yml，在文件的末尾添加如下内容：
http.cors.enabled: true
http.cors.allow-origin: "*"
以便支持在web中通过ajax访问。
query DSL和filter DSL区别：query用于全文检索并得到_score，filter用于精确匹配。
text有精确匹配和全文搜索的区别，long/date以及_id则没有。
Elasticsearch会为每个text field的每个单词建立inverted index索引。
默认情况下，ES区分大小写，复数与非负数，实际上我们希望他们不敏感。还有中文的匹配搜索。这种情况，我们需要使用analyzer，默认的分析器是标准分析器，它基于UNICODE TEXT SEGMENTATION进行分析。ES原生支持的语言分析器包括https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-lang-analyzer.html，其中不包括中文，所以默认每个汉字都是一个term。如果不希望某字段使用默认的分析器，必须通过在这些字段上声明mapping（也叫schema definition，也就是ddl的意思）来手工配置。
使用DSL语言作为查询条件的格式，也就是JSON格式。所有的查询结果都会返回一个_score，表示匹配程度。

Fielddata is disabled on text fields by default. Set fielddata=true on [interests] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory. Alternatively use a keyword field instead.
解决方法：http://blog.csdn.net/u011403655/article/details/71107415
https://www.elastic.co/guide/en/elasticsearch/reference/current/fielddata.html
在cluster中，有一个节点被选为master节点，其负责集群内的全局管理，比如增加/删除index、节点，但是不管理具体的事情。
查看ES集群状态
GET http://localhost:9200/_cluster/health
{
"cluster_name": "elasticsearch",
"status": "yellow",
"timed_out": false,
"number_of_nodes": 1,
"number_of_data_nodes": 1,
"active_primary_shards": 5,
"active_shards": 5,
"relocating_shards": 0,
"initializing_shards": 0,
"unassigned_shards": 5,
"delayed_unassigned_shards": 0,
"number_of_pending_tasks": 0,
"number_of_in_flight_fetch": 0,
"task_max_waiting_in_queue_millis": 0,
"active_shards_percent_as_number": 50
}
最重要的是status字段
取值：
green
All primary and replica shards are active.
yellow
All primary shards are active, but not all replica shards are active.（对于单节点的环境来说，replica shards没有什么意义）
red
Not all primary shards are active.
启动第二个节点的时候，节点会自动加入相同名称的cluster.name集群。Elasticsearch能够在节点宕机后自动重新选举master shard，这样就可以重新提供服务了。
Elasticsearch中，文档中的每个字段都被索引了，一个查询中。
元数据包括:
_index:必须小写，不能包括逗号，不能_开头
_type：每个type都有自己的模式定义或者称为mapping
_id：唯一标识一个type内的文档

默认情况下，ES基于相关性进行排序。如果要根据字段进行排序，则要指定如下：
GET /_search
{
"query" : {
"filtered" : {
"filter" : { "term" : { "user_id" : 1 }}
}
},
"sort": { "date": { "order": "desc" }}
}
如果排序不是基于相关性的话，_score不会被计算。计算_score的成本很高，所以指定了sort的话，默认不会计算_score，指定track_scores=true可以强行计算。
多条件匹配，首先根据date，其次根据相关性。
GET /_search
{
"query" : {
"filtered" : {
"query": { "match": { "tweet": "manage text search" }},
"filter" : { "term" : { "user_id" : 2 }}
}
},
"sort": [
{ "date": { "order": "desc" }},
{ "_score": { "order": "desc" }}
]
}

对于全文搜索的字段，排序没有意义，一般用相关度。
ES会将尽可能多的数据保存在内存中以提高性能。
ES的查询称为分布式搜索查询，分为查询和提取两部分。在查询阶段，请求会广播给所有的shard，返回符合条件的top N，根据order by条件。
查看indices层面的状态
GET _cluster/health?level=indices
GET _cluster/health?level=shards
节点的状态：
http://localhost:9200/_nodes/stats
每个JVM内存不要超过32GB，最好在30G以内，Elasticsearch和Lucene分别使用1/2的内存。前者使用JVM内存，后者使用OS的filesystem cache。不过如果这样配置的话，为了保证HA，需要设置初始化参数cluster.routing.allocation.same_shard.host:true，防止主和从shard分配到相同的机器。
聚合是通过称为fielddata的数据结构完成的，Fielddata是Elasticsearch集群中内存的最大消耗者。所以必须完全理解它。
Fielddata有点像RDBMS的数据块，只不过应该是行为单位的，会按需加载到内存。Fielddata存在的原因是因为inverted indices不是银弹，inverted indices擅长于找到包含某个分词（term）的文档，但是反过来，在某个文档中存在哪些个term就懵逼了，而聚合需要这种二次访问模式。

ES linux下安装
vi elasticsearch.yml
network.host: 0.0.0.0 否则只有本机才能访问
不能root用户执行
groupadd es
useradd -g es es
[2016-12-20T22:37:28,552][ERROR][o.e.b.Bootstrap ] [elk-node1] node validation exception
bootstrap checks failed
解决：使用centos 7版本，就不会出现此类问题了。
system call filters failed to install; check the logs and fix your configuration or disable system call filters at your own risk
原因：
这是在因为Centos6不支持SecComp，而ES5.2.0默认bootstrap.system_call_filter为true进行检测，所以导致检测失败，失败后直接导致ES不能启动。
解决：
在elasticsearch.yml中配置bootstrap.system_call_filter为false，注意要在Memory下面:
bootstrap.memory_lock: false
bootstrap.system_call_filter: false

vi /etc/security/limits.conf
添加如下内容:
* soft nofile 65536
* hard nofile 131072
* soft nproc 2048
* hard nproc 4096
vi /etc/sysctl.conf
添加下面配置：
vm.max_map_count=655360
并执行命令：
sysctl -p
然后，重新启动elasticsearch，即可启动成功。
elasticsearch-analysis-ik安装拷贝到ES_HOME/plugins目录下，命名为ik即可
elasticsearch-analysis-pinyin安装拷贝到ES_HOME/plugins目录下，命名为pinyin即可
elasticsearch-head的安装可见http://mobz.github.io/elasticsearch-head/，对于rhel 7/windows，没有问题。对于rhel 6，安装比较麻烦，特别是在nodejs和npm安装的时候，还要升级gcc到4.8，不然nodejs v6+安装不了，用0.6.x则npmjs各种麻烦。实际上也没什么用，cli都能查到必要的信息。