ElasticSearch 简单入门

一、前言

ElasticSearch 是一个分布式、可扩展、实时的搜索与数据分析引擎。它建立在 Apache Lucene 基础之上。Lucene 可以说是当下最先进、高性能、全功能的搜索引擎库（无论是开源还是私有）。ElasticSearch 将所有的功能打包成一个单独的服务，这样你可以通过程序与它提供的简单的 RESTful API 进行通信，可以使用自己喜欢的编程语言充当客户端。

二、使用场景

eBay 内部上百个 ElasticSearch 集群，超过 4000 个数据节点的规模，这些集群在 eBay 的生产环境中，支撑了包括订单搜索，商品推荐，集中化日志管理，风险控制，IT 运维，安全监控等不同领域的服务。

场景举例：

当你在 Github 上搜索时，ElasticSearch 不仅可以帮你找到相关的代码库，还可以帮助你实现代码级的搜索与高亮显示
当你在网上购物时，ElasticSearch 可以帮你推荐相关的商品
当你下班打车回家时，ElasticSearch 可以通过定位附近的乘客和司机，帮助平台优化调度
Wikipedia 使用 ElasticSearch 提供高亮片段的全文搜索。

除了搜索，结合 Kibana、Logstash、Beats、Elastic Stack 还被广泛运用在大数据近实时分析领域，包括日志分析、指标监控、信息安全多个领域。它可以帮助你探索海量结构化、非结构化数据，按需创建可视化报表，对监控数据设置报警阈值。甚至通过使用机器学习技术，自动识别异常状况。

三、单实例安装

介质准备：

elasticsearch-7.10.2-linux-x86_64.tar.gz
elasticsearch-analysis-ik-7.10.2.zip
elasticsearch-analysis-pinyin-7.10.2.zip
kibana-7.10.2-linux-x86_64.tar.gz

主机参数设置（/etc/sysctl.conf）：

# sysctl settings are defined through files in
# /usr/lib/sysctl.d/, /run/sysctl.d/, and /etc/sysctl.d/.
#
# Vendors settings live in /usr/lib/sysctl.d/.
# To override a whole file, create a new file with the same in
# /etc/sysctl.d/ and put new settings there. To override
# only specific settings, add a file with a lexically later
# name in /etc/sysctl.d/ and put new settings there.
#
# For more information, see sysctl.conf(5) and sysctl.d(5).
net.ipv4.tcp_tw_reuse = 0
net.ipv4.tcp_tw_recycle = 0
net.ipv4.tcp_fin_timeout = 5
net.ipv4.tcp_keepalive_time = 15
net.ipv4.ip_local_port_range = 21000 61000
fs.file-max = 6553600
kernel.sem = 250 32000 100 128
net.ipv4.conf.all.accept_redirects = 0
net.core.somaxconn = 32768
vm.max_map_count = 524288

生效：sysctl -p

主机参数设置（/etc/security/limits.conf）：

*  soft  nofile   1048576
*  hard  nofile   1048576
*  soft  nproc    65536
*  hard  nproc    65536
*  soft  memlock  unlimited
*  hard  memlock  unlimited

目录规划：

.
|-- bin
|   |-- schema
|   |-- start-es.sh
|   |-- start-kibana.sh
|   |-- stop-es.sh
|   `-- sync
|-- data -> /data/es-data
|-- etc
|-- lib
|   |-- ojdbc8-19.8.0.0.jar
|   `-- orai18n-19.8.0.0.jar
|-- logs
|-- sbin
|-- support
    |-- elasticsearch-7.10.2
    |-- es -> elasticsearch-7.10.2
    |-- kibana -> kibana-7.10.2-linux-x86_64
    |-- kibana-7.10.2-linux-x86_64
    |-- logstash -> logstash-7.10.2
    `-- logstash-7.10.2

.bash_profiler 设置

# .bash_profile

# Get the aliases and functions
if [ -f ~/.bashrc ]; then
    . ~/.bashrc
fi

# +-------------------------------------+
# |      AI'S PROFILE, DON'T MODIFY!    |
# +-------------------------------------+
alias grep='grep --colour=auto'
alias vi='vim'
alias ll='ls -l'
alias ls='ls --color=auto'
alias mv='mv -i'
alias rm='rm -i'
alias ups='ps -u `whoami` -f'

export ES_HOME=${HOME}/support/es
export JAVA_HOME=${ES_HOME}/jdk
export PS1="[33[01;32m]u@h[33[01;34m] w $[33[00m] "
export TERM=linux
export EDITOR=vim
export PATH=${HOME}/bin:${HOME}/sbin:${JAVA_HOME}/bin:${ES_HOME}/bin:${HOME}/support/logstash/bin:$PATH
export LANG=zh_CN.utf8
export TIMOUT=3000
export HISTSIZE=1000

根据环境调整 JVM 内存：~/support/es/config/jvm.options

-Xms16g
-Xmx16g

根据环境设置基础配置：~/support/es/config/elasticsearch.yml

# ======================== Elasticsearch Configuration =========================
#
# NOTE: Elasticsearch comes with reasonable defaults for most settings.
#       Before you set out to tweak and tune the configuration, make sure you
#       understand what are you trying to accomplish and the consequences.
#
# The primary way of configuring a node is via this file. This template lists
# the most important settings you may want to configure for a production cluster.
#
# Please consult the documentation for further information on configuration options:
# https://www.elastic.co/guide/en/elasticsearch/reference/index.html
#
# ---------------------------------- Cluster -----------------------------------
#
# Use a descriptive name for your cluster:
#
cluster.name: crm
#
# ------------------------------------ Node ------------------------------------
#
# Use a descriptive name for the node:
#
node.name: node-1
#
# Add custom attributes to the node:
#
node.attr.rack: r1
#
# ----------------------------------- Paths ------------------------------------
#
# Path to directory where to store the data (separate multiple locations by comma):
#
path.data: /home/es/data
#
# Path to log files:
#
path.logs: /home/es/logs
#
# ----------------------------------- Memory -----------------------------------
#
# Lock the memory on startup:
#
#bootstrap.memory_lock: true
#
# Make sure that the heap size is set to about half the memory available
# on the system and that the owner of the process is allowed to use this
# limit.
#
# Elasticsearch performs poorly when the system is swapping the memory.
#
# ---------------------------------- Network -----------------------------------
#
# Set the bind address to a specific IP (IPv4 or IPv6):
#
network.host: 10.230.55.48
#
# Set a custom port for HTTP:
#
http.port: 9200
#
# For more information, consult the network module documentation.
#
# --------------------------------- Discovery ----------------------------------
#
# Pass an initial list of hosts to perform discovery when this node is started:
# The default list of hosts is ["127.0.0.1", "[::1]"]
#
discovery.seed_hosts: ["10.230.55.48"]
#
# Bootstrap the cluster using an initial set of master-eligible nodes:
#
cluster.initial_master_nodes: ["10.230.55.48"]
#
# For more information, consult the discovery and cluster formation module documentation.
#
# ---------------------------------- Gateway -----------------------------------
#
# Block initial recovery after a full cluster restart until N nodes are started:
#
#gateway.recover_after_nodes: 1
#
# For more information, consult the gateway module documentation.
#
# ---------------------------------- Various -----------------------------------
#
# Require explicit names when deleting indices:
#
#action.destructive_requires_name: true

# 安全认证配置:
http.cors.enabled: true
http.cors.allow-origin: "*"
http.cors.allow-headers: Authorization
xpack.security.enabled: true
xpack.security.transport.ssl.enabled: true

启动脚本（~/bin/start-es.sh）：

#!/bin/sh

cd ~/support/es/bin
./elasticsearch -d

设置密码：

～/support/es/bin/elasticsearch-setup-passwords interactive

需要设置 elastic，apm_system，kibana，kibana_system，logstash_system，beats_system，remote_monitoring_user 这些用户的密码，设置完就可以了。

验证：

es@centos01 ~/bin $ curl --user elastic:123456 -XGET http://10.230.55.48:9200?pretty=true
Enter host password for user 'elastic':
{
  "name" : "node-1",
  "cluster_name" : "crm",
  "cluster_uuid" : "1SAd8U-zRyGKy8ztRWAQhQ",
  "version" : {
    "number" : "7.10.2",
    "build_flavor" : "default",
    "build_type" : "tar",
    "build_hash" : "747e1cc71def077253878a59143c1f785afa92b9",
    "build_date" : "2021-01-13T00:42:12.435326Z",
    "build_snapshot" : false,
    "lucene_version" : "8.7.0",
    "minimum_wire_compatibility_version" : "6.8.0",
    "minimum_index_compatibility_version" : "6.0.0-beta1"
  },
  "tagline" : "You Know, for Search"
}

Kibana 安装：

目录：~/support/kibana

修改配置：～/support/kibana/config/kibana.yml

server.port: 5601
server.host: "10.230.55.48"
elasticsearch.hosts: ["http://10.230.55.48:9200"]
elasticsearch.username: "elastic"
elasticsearch.password: "123456"
i18n.locale: "en"

Dev Tools：

# 查看 Elastic 版本信息
GET /

# 查看集群健康情况
GET _cluster/health

# 查看集群节点
GET _cat/nodes

# 分片情况
GET _cat/shards

# 查看索引清单
GET _cat/indices

# 查看索引数据量
GET sec_function/_count

四、索引

查看当前节点的所有 Index：

es@centos01 ~ $ curl --user elastic:123456 -XGET http://10.230.55.48:9200/_cat/indices?v
Enter host password for user 'elastic':
health status index                             uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   sec_function                      90qf16nIQNqfd_l_deqWpA  10   0     8040             51      4.2mb          4.2mb
green  open   pm_offer_for_trans                GSehvq3EQZKgWnkztShSyw  10   0    1056308       172784    313.6mb        313.6mb
green  open   tf_r_address_tree                 F5xwcaRfTYmfReiPgOE3Fg  10   0   11490425            0        1gb            1gb

新建和删除索引：

es@centos01 ~ curl --user elastic:123456 -XPUT 'http://10.230.55.48:9200/weather'
Enter host password for user 'elastic':
{"acknowledged":true,"shards_acknowledged":true,"index":"weather"} 

es@centos01 ~ curl -uelastic -XGET http://10.230.55.48:9200/_cat/indices?v

health status index               uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   sec_function        90qf16nIQNqfd_l_deqWpA  10   0       8040           51      4.2mb          4.2mb
green  open   pm_offer_for_trans  GSehvq3EQZKgWnkztShSyw  10   0    1056308       172784    313.6mb        313.6mb
green  open   tf_r_address_tree   F5xwcaRfTYmfReiPgOE3Fg  10   0   11490425            0        1gb            1gb
green  open   weather             vIVMeX22SReCpKGD0Pk5uw  5    1          0            0      2.2kb          1.1kb

es@centos01 ~ curl -uelastic -XDELETE 'http://10.230.55.48:9200/weather'
{"acknowledged":true}

五、中文分词

将 elasticsearch-analysis-ik-7.10.2.zip、elasticsearch-analysis-pinyin-7.10.2.zip 解压到 ~/support/es/plugins 目录下，并重启 ES。

es@centos01 ~/support $ tree ~/support/es/plugins/
/home/es/support/es/plugins/
|-- ik
|   |-- commons-codec-1.9.jar
|   |-- commons-logging-1.2.jar
|   |-- config
|   |   |-- extra_main.dic
|   |   |-- extra_single_word.dic
|   |   |-- extra_single_word_full.dic
|   |   |-- extra_single_word_low_freq.dic
|   |   |-- extra_stopword.dic
|   |   |-- IKAnalyzer.cfg.xml
|   |   |-- main.dic
|   |   |-- preposition.dic
|   |   |-- quantifier.dic
|   |   |-- stopword.dic
|   |   |-- suffix.dic
|   |   `-- surname.dic
|   |-- elasticsearch-analysis-ik-7.10.2.jar
|   |-- httpclient-4.5.2.jar
|   |-- httpcore-4.4.4.jar
|   |-- plugin-descriptor.properties
|   `-- plugin-security.policy
`-- pinyin
    |-- elasticsearch-analysis-pinyin-7.10.2.jar
    |-- nlp-lang-1.7.jar
    `-- plugin-descriptor.properties

3 directories, 22 files

六、数据操作

新索引准备：

curl --user elastic:123456 -XPUT 'http://10.230.55.48:9200/student' -H 'Content-Type: application/json' -d '
{
  "mappings" : {
    "properties" : {
      "name" : {
        "type" : "keyword"
      },
      "age" : {
        "type" : "integer"
      }
    }
  },
  "settings" : {
    "index" : {
      "number_of_shards" : 1,
      "number_of_replicas" : 0
    }
  }
}'

新增记录（使用 POST）：

添加数据示例一：（POST 用于更新数据，如果不存在，则会创建。）

# 请求，没有指定 _id 的情况下，Elastic 将为你自动生成一个随机字符串作为 _id。
curl --user elastic:123456 -XPOST 'http://10.230.55.48:9200/student/_doc?pretty=true' -H 'Content-Type: application/json' -d '
{
  "name": "张三"
}'

# 响应
{
  "_index" : "student",
  "_type" : "_doc",
  "_id" : "q6ek7XcBqu3Z6vLyxDD4",
  "_version" : 1,
  "result" : "created",
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "_seq_no" : 0,
  "_primary_term" : 1
}

添加数据实例二：（指定 _id 为 2）

# 请求，指定 _id 为 2
curl --user elastic:123456 -XPOST 'http://10.230.55.48:9200/student/_doc/2?pretty=true' -H 'Content-Type: application/json' -d '
{
  "name": "李四"
}'

# 响应
{
  "_index" : "student",
  "_type" : "_doc",
  "_id" : "2",
  "_version" : 1,
  "result" : "created",
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "_seq_no" : 1,
  "_primary_term" : 1
}

一种错误的数据更新方式：

# 请求
curl --user elastic:123456 -XGET 'http://10.230.55.48:9200/student/_doc/2?pretty=true'

# 响应
{
  "_index" : "student",
  "_type" : "_doc",
  "_id" : "2",
  "_version" : 1,
  "_seq_no" : 1,
  "_primary_term" : 1,
  "found" : true,
  "_source" : {
    "name" : "李四"
  }
}

我们注意到结果中没有 age 字段。

# 请求
curl --user elastic:123456 -XPOST 'http://10.230.55.48:9200/student/_doc/2?pretty=true' -H 'Content-Type: application/json' -d '
{
  "age": 10
}'

# 响应
{
  "_index" : "student",
  "_type" : "_doc",
  "_id" : "2",
  "_version" : 2,
  "result" : "updated",
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "_seq_no" : 2,
  "_primary_term" : 1
}

# 请求
curl --user elastic:123456 -XGET 'http://10.230.55.48:9200/student/_doc/2?pretty=true'

# 响应
{
  "_index" : "student",
  "_type" : "_doc",
  "_id" : "2",
  "_version" : 2,
  "_seq_no" : 2,
  "_primary_term" : 1,
  "found" : true,
  "_source" : {
    "age" : 10
  }
}

结果是 version 从1变成了2，而 name 字段不见了。原因是 POST student/_doc/2 这种语法的效果是覆盖数据。可以理解为先把原文档删除，再索引新文档。

使用 _update 更新文档

es@centos01 ~ $ curl --user elastic:123456 -XPOST 'http://10.230.55.48:9200/student/_doc/2?pretty=true' -H 'Content-Type: application/json' -d '
{
  "name": "李四"
}'

es@centos01 ~ $ curl --user elastic:123456 -XPOST 'http://10.230.55.48:9200/student/_doc/2/_update?pretty=true' -H 'Content-Type: application/json' -d '
{
  "doc": {
    "age": 10
  }
}'


# 请求
es@centos01 ~ $ curl --user elastic:123456 -XGET 'http://10.230.55.48:9200/student/_doc/2?pretty=true'                                                  
{
  "_index" : "student",
  "_type" : "_doc",
  "_id" : "2",
  "_version" : 4,
  "_seq_no" : 4,
  "_primary_term" : 1,
  "found" : true,
  "_source" : {
    "name" : "李四",
    "age" : 10
  }
}

使用 _update 时，ES 做了下面几件事：

从旧文档构建 JSON
更改该 JSON
删除旧文档
索引一个新文档