Elasticsearch

https://www.elastic.co/guide/en/elasticsearch/reference/6.0/getting-started.html

Elasticsearch is a highly scalable open-source full-text search and analytics engine. It allows you to store, search, and analyze big volumes of data quickly and in near real time. It is generally used as the underlying engine/technology that powers applications that have complex search features and requirements.

document operation api

https://www.elastic.co/guide/en/elasticsearch/reference/5.5/docs.html

This section starts with a short introduction to Elasticsearch’s data replication model, followed by a detailed description of the following CRUD APIs:

Single document APIs

Index API

Get API

Delete API

Update API

Multi-document APIs

Multi Get API

Bulk API

Delete By Query API

Update By Query API

Reindex API

full text query api

https://www.elastic.co/guide/en/elasticsearch/reference/current/full-text-queries.html

The full text queries enable you to search analyzed text fields such as the body of an email. The query string is processed using the same analyzer that was applied to the field during indexing.

The queries in this group are:

intervals query
A full text query that allows fine-grained control of the ordering and proximity of matching terms.
match query
The standard query for performing full text queries, including fuzzy matching and phrase or proximity queries.
match_bool_prefix query
Creates a bool query that matches each term as a term query, except for the last term, which is matched as a prefix query
match_phrase query
Like the match query but used for matching exact phrases or word proximity matches.
match_phrase_prefix query
Like the match_phrase query, but does a wildcard search on the final word.
multi_match query
The multi-field version of the match query.
common terms query
A more specialized query which gives more preference to uncommon words.
query_string query
Supports the compact Lucene query string syntax, allowing you to specify AND|OR|NOT conditions and multi-field search within a single query string. For expert users only.
simple_query_string query
A simpler, more robust version of the query_string syntax suitable for exposing directly to users.

Dev Tool

https://zhuanlan.zhihu.com/p/54384152

安装 Kibana

这是一个官方推出的把 Elasticsearch 数据可视化的工具，官网在这里：【传送门】，不过我们现在暂时还用不到那些数据分析的东西，不过里面有一个 Dev Tools 的工具可以方便的和 Elasticsearch 服务进行交互，去官网下载了最新版本的 Kibana（6.5.4）结果不知道为什么总是启动不起来，所以换一了一个低版本的（6.2.2）正常，给个下载外链：下载点这里，你们也可以去官网试试能不能把最新的跑起来：

等待一段时间后就可以看到提示信息，运行在 5601 端口，我们访问地址 http://localhost:5601/app/kibana#/dev_tools/console?_g=() 可以成功进入到 Dev-tools 界面：

点击【Get to work】，然后在控制台输入 GET /_cat/health?v 查看服务器状态，可以在右侧返回的结果中看到 green 即表示服务器状态目前是健康的：

安装

https://www.elastic.co/downloads/elasticsearch

https://www.elastic.co/guide/cn/kibana/current/targz.html

插件与分词工具

https://www.elastic.co/guide/en/elasticsearch/plugins/current/intro.html

Plugins are a way to enhance the core Elasticsearch functionality in a custom manner. They range from adding custom mapping types, custom analyzers, native scripts, custom discovery and more.

https://www.elastic.co/guide/en/elasticsearch/plugins/current/analysis-smartcn.html

The Smart Chinese Analysis plugin integrates Lucene’s Smart Chinese analysis module into elasticsearch.

It provides an analyzer for Chinese or mixed Chinese-English text. This analyzer uses probabilistic knowledge to find the optimal word segmentation for Simplified Chinese text. The text is first broken into sentences, then each sentence is segmented into words.
sudo bin/elasticsearch-plugin install analysis-smartcn

https://www.elastic.co/guide/en/elasticsearch/plugins/current/ingest-attachment.html

The ingest attachment plugin lets Elasticsearch extract file attachments in common formats (such as PPT, XLS, and PDF) by using the Apache text extraction library Tika.

You can use the ingest attachment plugin as a replacement for the mapper attachment plugin.

The source field must be a base64 encoded binary. If you do not want to incur the overhead of converting back and forth between base64, you can use the CBOR format instead of JSON and specify the field as a bytes array instead of a string representation. The processor will skip the base64 decoding then.
sudo bin/elasticsearch-plugin install ingest-attachment

TF/IDF

https://www.cnblogs.com/gongxijun/p/8673241.html

##TF-IDF

TF（词频）: 假定存在一份有N个词的文件A，其中‘明星‘这个词出现的次数为T。那么 TF = T/N;

所以表示为：某一个词在某一个文件中出现的频率.

TF-IDF(词频-逆向文件频率)：表示的词频和逆向文件频率的乘积.

比如：假定存在一份有N个词的文件A，其中‘明星‘这个词出现的次数为T。那么 TF = T/N; 并且‘明星’这个词，在W份文件中出现，而总共有X份文件，那么

IDF = log(X/W) ;

而： TF-IDF = TF * IDF = T/N * log(X/W); 我们发现，‘明星’，这个出现在W份文件，W越小 TF-IDF越大，也就是这个词越有可能是该文档的关键字，而不是习惯词（类似于：‘的’，‘是’，‘不是’这些词），

而TF越大，说明这个词在文档中的信息量越大.

reference

http://www.ruanyifeng.com/blog/2017/08/elasticsearch.html

https://www.elastic.co/blog/a-practical-introduction-to-elasticsearch