ElasticSearch入门系列（五）数据

序言：无论程序如何写，最终都是为了组织数据为我们服务。在实际应用中，并不是所有相同类型的实体的看起来都是一样的。传统上我们使用行和列将数据存储在关系型数据库中相当于使用电子表格，这种固定的存储方式导致对象的灵活性不复存在了。Elasticsearch是一个分布式的文档存储引擎，默认每一个字段的数据都是可以被索引的。

一、什么是文档

程序中大多的实体或对象能够被序列化为包含键值对的JSON对象，键是字段或属性的名字，值可以是字符串、数字、布尔类型、另一个对象、值数组或者其他特殊类型、

{
    "name":         "John Smith",
    "age":          42,
    "confirmed":    true,
    "join_date":    "2014-06-01",
    "home": {
        "lat":      51.5,
        "lon":      0.1
    },
    "accounts": [
        {
            "type": "facebook",
            "id":   "johnsmith"
        },
        {
            "type": "twitter",
            "id":   "johnsmith"
        }
    ]
}

通常，我们认为对象和文档是等价相同的，不过还是有一些区别的，对象是一个JSON结构体，类似于哈希。hashmap、字典或者关联数组；对象中害了能包含其他对象。在Elasticsearch中，文档是指顶层结构或者根对象序列化成的JSON数据

文档元数据：

一个文档不只有数据，还包含了元数据，三个必须的元数据节点是：

_index 文档存储的地方

_type 文档代表的对象的类

_id 文档的唯一标识

①：_index

索引类似于关系型数据库里的数据库，是我们存储和索引关联数据的地方。索引名必须是全部小写，不能以下划线开头，不能包含逗号。

②：_type

在关系型数据库中，我们经常使用相同类的对象存储在一个表里，因为他们有着相同的结构。在Elasticsearch中我们使用相同类型的文档表示相同的事物。

每个类型都有自己的映射或结构定义，就像传统数据库表中的列一样，所有类型下的文档被存储在同一个索引下，但是类型的映射会告诉Elasticsearch不同的文档如何被索引。

_type的名字可以是大写或小写，不能包含下划线或逗号。

③：_id

仅仅是一个字符串，用于位子标识一个文档，也可以让Elasticsearch帮你自动生成。

二、索引一个文档

文档通过index API被索引--使数据可以被存储和搜索，但是首先我们需要觉得文档所在。我们可以通过_index _type _id唯一确定，也可以使用index API为我们自动生成一个。

1、使用自己的ID

如果你的文档有自然的标识符，我们可以提供自己的_id

PUT /{index}/{type}/{id}
{
  "field": "value",
  ...
}

eg:索引website 类型blog ID为123

PUT /website/blog/123
{
  "title": "My first blog entry",
  "text":  "Just trying this out...",
  "date":  "2014/01/01"
}

Elasticsearch响应：

{
   "_index":    "website",
   "_type":     "blog",
   "_id":       "123",
   "_version":  1,
   "created":   true
}

响应已经指出创建成功，这个索引中包含_index _type _id元数据，以及一个新元素_version

2、自增ID

如果我们没有自然ID。可以让Elasticsearch自动为我们生成，PUT方法变为 POST方法。此时的URL只包含_index 和_type两个字段

POST /website/blog/
{
  "title": "My second blog entry",
  "text":  "Still trying this out...",
  "date":  "2014/01/01"
}

响应为：

{
   "_index":    "website",
   "_type":     "blog",
   "_id":       "wM0OSFhDQXGZAWDf0-drSA",
   "_version":  1,
   "created":   true
}

自动增长的ID有22个字符，UUID

三、检索文档

想要从Elasticsearch中获取文档我们使用同样的_index _type _id但是HTTP方法改为GET

GET /website/blog/123?pretty

响应里增加了_source字段，他包含了在创建索引时我们发送给Elasticsearch的原始文档

{
  "_index" :   "website",
  "_type" :    "blog",
  "_id" :      "123",
  "_version" : 1,
  "found" :    true,
  "_source" :  {
      "title": "My first blog entry",
      "text":  "Just trying this out...",
      "date":  "2014/01/01"
  }
}

pretty：用于美化输出

GET请求返回的响应内容包括found：true这意味着文档已经找到，如果请求不存在的文档，found就会变成false，状态码也会变为404 Not Found 可以通过在curl后加-i参数得到响应头

curl -i -XGET http://localhost:9200/website/blog/124?pretty现在的响应变为

{
  "_index" :   "website",
  "_type" :    "blog",
  "_id" :      "123",
  "_version" : 1,
  "found" :    true,
  "_source" :  {
      "title": "My first blog entry",
      "text":  "Just trying this out...",
      "date":  "2014/01/01"
  }
}

检索文档的一部分：

通常GET会返回全部的文档，如果只是需要一部分，可以使用_source参数，多个字段使用逗号分隔。

GET /website/blog/123?_source=title,text

{
  "_index" :   "website",
  "_type" :    "blog",
  "_id" :      "123",
  "_version" : 1,
  "found" :    true,
  "_source" :  {
      "title": "My first blog entry",
      "text":  "Just trying this out...",
      "date":  "2014/01/01"
  }
}

如果只是想得到_source字段而不用其他元数据

GET /website/blog/123/_source

{
   "title": "My first blog entry",
   "text":  "Just trying this out...",
   "date":  "2014/01/01"
}

四、检查文档是否存在

如果只是想检查文档是否存在，使用HEAD方法来代替GET，HEAD请求不会反悔响应体，只有HTTP头。

curl -i -XHEAD http://localhosy:9200/website/blog/123

如果存在：

HTTP/1.1 200 OK
Content-Type: text/plain; charset=UTF-8
Content-Length: 0

如果不存在

HTTP/1.1 404 Not Found
Content-Type: text/plain; charset=UTF-8
Content-Length: 0

五、更新整个文档

文档在Elasticsearch中是不可改变的，我们不能修改他们，如果要更新已存在的文档，我们可以使用index API重新索引或者替换掉他。

eg：

PUT /website/blog/123
{
  "title": "My first blog entry",
  "text":  "I am starting to get the hang of this...",
  "date":  "2014/01/02"
}

响应：

{
  "_index" :   "website",
  "_type" :    "blog",
  "_id" :      "123",
  "_version" : 2,
  "created":   false <1>
}

<1>created标识为false因为同索引同类型下已经存在同ID的文档，从上面响应中我们可以看到version增加了。

在内部，Elasticsearch已经标记旧文档为删除并添加了一个完整的新文档，旧文档不会立即删除，但是也不能去访问。

六、创建一个新文档

当索引一个文档我们如何确定是完全创建了一个新的还是覆盖了已经存在的呢？

_index _type _id三者唯一确定一个文档。要保证文档是新加入的，最简单的方式是使用POST方法让Elasticsearch自动生成唯一_id

POST /website/blog

如果想使用自定义的_id我们必须告诉Elasticsearch应该在_index _type _id三者都不同时才接受请求

第一种方法使用op_type查询参数：

PUT /website/blog/123?op_type=create

第二种方法在URL后加/_create作为端点

PUT /website/blog/123/_create

如果创建成功将返回一个响应状态吗201 如果有冲突将返回409

{
  "error" : "DocumentAlreadyExistsException[[website][4] [blog][123]:
             document already exists]",
  "status" : 409
}

七、删除文档

语法：DELETE /website/blog/123

如果文档被找到将返回200状态码并且version数字增加

{
  "found" :    true,
  "_index" :   "website",
  "_type" :    "blog",
  "_id" :      "123",
  "_version" : 3
}

如果文档未被找到，将返回404,version同样增加了，为了确保多节点不同操作的正确顺序

{
  "found" :    false,
  "_index" :   "website",
  "_type" :    "blog",
  "_id" :      "123",
  "_version" : 4
}

八、处理冲突

当使用index API更新文档的时候。我们读取原始文档，做修改，然后将整个文档一次性重新索引。做进的索引请求会生效，如果其他人同时也修改了这个文档，衙门的修改将会丢失。。

web_1让stock_count失效是因为web_2没有察觉到stock_count的拷贝已经过期。变化越频繁，或读取和更新间的时间越长，越容易丢失我们的更高

在数据库中，有两种通用的方法确保在并发更新时修改不会丢失：

悲观并发控制：

这在关系型数据库中被广泛的使用，假设冲突的更改经常发生，为了解决冲突我们把访问区块化，典型的例子是在读一行数据前锁定这行，然后确保只有枷锁的那个线程可以修改这行数据。

乐观并发控制：

被Elasticsearch使用，假设冲突不经常发生，也不区块化访问。然而如果在读写过程中数据发生了变化，更新操作失败，这时由主观觉得如何解决。

Elasticsearch是分布式的，当文档被创建、更新或删除时，文档的新版本会被复制到集群的其他节点。

Elasticsearch既是同步的又是异步的，意思是这些复制请求都是平行发送的，并无序的到达目的地。这就需要一种方法确保老版本的文档永远不会覆盖新的版本。

每个文档都有一个version号码。这个号码在文档改变时加一。他可以确保修改被正确排序。当一个旧版本出现在新版本之后，它会被忽略。

我们可以利用version做想要的更改，如果version不是现在的，我们的请求就失败了。

示例：创建一个新的博文

PUT /website/blog/1/_create
{
  "title": "My first blog entry",
  "text":  "Just trying this out..."
}

响应体告诉我们这是新建的文档version是1，如果我们要编辑然后保存新版本

首先先检索

GET /website/blog/1

响应体：

{
  "_index" :   "website",
  "_type" :    "blog",
  "_id" :      "1",
  "_version" : 1,
  "found" :    true,
  "_source" :  {
      "title": "My first blog entry",
      "text":  "Just trying this out..."
  }
}

修改：

PUT /website/blog/1?version=1 <1>
{
  "title": "My first blog entry",
  "text":  "Starting to get the hang of this..."
}

<1>我们只希望在version是1的时候更新才生效

请求成功的响应体

{
  "_index":   "website",
  "_type":    "blog",
  "_id":      "1",
  "_version": 2
  "created":  false
}

请求失败的响应体

{
  "error" : "VersionConflictEngineException[[website][2] [blog][1]:
             version conflict, current [2], provided [1]]",
  "status" : 409
}

使用外部版本控制系统：

一种常见的结果是使用一些其他的数据库作为主数据库，然后使用Elasticsearch搜索数据，这意味着所有主数据库发生变化，就要将其拷贝到Elasticsearch中，如果多个进程负责数据同步，就会遇上并发问题。

如果主数据库中有版本字段或一些类似于timestamp等用于版本控制的字段，可以在查询字符串后面添加version_type=external来使用这些版本号。

eg：创建一个包含外部版本号为5的新博客

PUT /website/blog/2?version=5&version_type=external
{
  "title": "My first external blog entry",
  "text":  "Starting to get the hang of this..."
}

响应体：

{
  "_index":   "website",
  "_type":    "blog",
  "_id":      "2",
  "_version": 5,
  "created":  true
}

更新文档指定version为10

PUT /website/blog/2?version=10&version_type=external
{
  "title": "My first external blog entry",
  "text":  "This is a piece of cake..."
}

响应体：

{
  "_index":   "website",
  "_type":    "blog",
  "_id":      "2",
  "_version": 10,
  "created":  false
}

如果运行第二次会返回冲突错误

九、文档局部更新

示例：更新views字段

POST /website/blog/1/_update
{
   "doc" : {
      "tags" : [ "testing" ],
      "views": 0
   }
}

请求成功：

{
   "_index" :   "website",
   "_id" :      "1",
   "_type" :    "blog",
   "_version" : 3
}

检索查看结果：

{
   "_index":    "website",
   "_type":     "blog",
   "_id":       "1",
   "_version":  3,
   "found":     true,
   "_source": {
      "title":  "My first blog entry",
      "text":   "Starting to get the hang of this...",
      "tags": [ "testing" ], <1>
      "views":  0 <1>
   }
}

<1>我们新添加的字段已经被添加到source字段中

使用脚本局部更新：

当API不能瞒住要求时，Elasticsearch允许使用脚本实现自己的逻辑。默认脚本语言是Groovy

脚本能够使用update API改变_source字段的内容，他在脚本内部以ctx._source表示，例如我们可以使用脚本增加博客的views数量

POST /website/blog/1/_update
{
   "script" : "ctx._source.views+=1"
}

增加一个新标签到tags数组中：

POST /website/blog/1/_update
{
   "script" : "ctx._source.tags+=new_tag",
   "params" : {
      "new_tag" : "search"
   }
}

获取请求的文档：

{
   "_index":    "website",
   "_type":     "blog",
   "_id":       "1",
   "_version":  5,
   "found":     true,
   "_source": {
      "title":  "My first blog entry",
      "text":   "Starting to get the hang of this...",
      "tags":  ["testing", "search"], <1>
      "views":  1 <2>
   }
}

<1>search标签已经被添加到tags数组中

<2>views字段已经被添加

通过设置ctx.op为delete可以根据内容删除文档

POST /website/blog/1/_update
{
   "script" : "ctx.op = ctx._source.views == count ? 'delete' : 'none'",
    "params" : {
        "count": 1
    }
}

更新不可能存在的文档

比如我们记录浏览器计数器，当有新用户访问，我们增加，如果是新页面则会更新失败，我们可以使用upsert定义文档来使其不存在时被创建。

POST /website/pageviews/1/_update
{
   "script" : "ctx._source.views+=1",
   "upsert": {
       "views": 1
   }
}

更新和冲突：

如果发生冲突而我们又不关心其执行顺序。只要设置重新尝试次数就可以

POST /website/pageviews/1/_update?retry_on_conflict=5 <1>
{
   "script" : "ctx._source.views+=1",
   "upsert": {
       "views": 0
   }
}

<1>在错误发生前重新更新5次

十、检索多个文档

像Elasticsearch一样，检索多个文档依旧非常快，合并多个请求可以避免每个请求单独的网络开销。使用multi-get或者mget API

mget API参数是一个docs数组，数组的每个节点定义一个文档的_index _type _id元数据。如果只是检索一个或几个确定的字段也可以定义/_source参数

POST /website/pageviews/1/_update?retry_on_conflict=5 <1>
{
   "script" : "ctx._source.views+=1",
   "upsert": {
       "views": 0
   }
}

响应体也包含docs数组。每个文档还包含一个相应，他们按照请求定义的顺序排列：

{
   "docs" : [
      {
         "_index" :   "website",
         "_id" :      "2",
         "_type" :    "blog",
         "found" :    true,
         "_source" : {
            "text" :  "This is a piece of cake...",
            "title" : "My first external blog entry"
         },
         "_version" : 10
      },
      {
         "_index" :   "website",
         "_id" :      "1",
         "_type" :    "pageviews",
         "found" :    true,
         "_version" : 2,
         "_source" : {
            "views" : 2
         }
      }
   ]
}

如果检索在同一个_index中甚至同一个_type可以在URL中定义一个默认的/_index或/_index/_type

POST /website/blog/_mget
{
   "docs" : [
      { "_id" : 2 },
      { "_type" : "pageviews", "_id" :   1 }
   ]
}

也可以简写为：

POST /website/blog/_mget
{
   "ids" : [ "2", "1" ]
}

如果其中一个不存在也会告知：

{
  "docs" : [
    {
      "_index" :   "website",
      "_type" :    "blog",
      "_id" :      "2",
      "_version" : 10,
      "found" :    true,
      "_source" : {
        "title":   "My first external blog entry",
        "text":    "This is a piece of cake..."
      }
    },
    {
      "_index" :   "website",
      "_type" :    "blog",
      "_id" :      "1",
      "found" :    false  <1>
    }
  ]
}

十、批量

像mget允许我们一次性检索多个文档一样，bulk API允许我们使用单一请求来实现多个文档的create index update delete

bulk请求体：

{ action: { metadata }}

{ request body        }

{ action: { metadata }}

{ request body        }

类似于符号连接起来的一行一行的JSON文档流

action/metadata这一行定义了文档行为发生在哪个文档上。

行为必须是以下几种：

create：当文档不存在时创建

index：创建新文档或替换已有文档

update：局部更新文档

delete删除一个文档

在索引，创建更新或删除时必须制定文档的_index _type _id的元数据

例如删除请求看起来像这样：

{ "delete": { "_index": "website", "_type": "blog", "_id": "123" }}

请求体由文档的_source组成

删除的时候不需要请求体，如果定义_id,ID将会自动创建

{ "create":  { "_index": "website", "_type": "blog", "_id": "123" }}
{ "title":    "My first blog post" }

{ "index": { "_index": "website", "_type": "blog" }}
{ "title":    "My second blog post" }

放在一起：

POST /_bulk
{ "delete": { "_index": "website", "_type": "blog", "_id": "123" }} <1>
{ "create": { "_index": "website", "_type": "blog", "_id": "123" }}
{ "title":    "My first blog post" }
{ "index":  { "_index": "website", "_type": "blog" }}
{ "title":    "My second blog post" }
{ "update": { "_index": "website", "_type": "blog", "_id": "123", "_retry_on_conflict" : 3} }
{ "doc" : {"title" : "My updated blog post"} } <2>

<1>注意delete没有请求体，紧跟着另一个行为

<2>记得最后一个换行符

响应结果中包含items数组，罗列了请求的结果，结果顺序和请求顺序相同：

{
   "took": 4,
   "errors": false, <1>
   "items": [
      {  "delete": {
            "_index":   "website",
            "_type":    "blog",
            "_id":      "123",
            "_version": 2,
            "status":   200,
            "found":    true
      }},
      {  "create": {
            "_index":   "website",
            "_type":    "blog",
            "_id":      "123",
            "_version": 3,
            "status":   201
      }},
      {  "create": {
            "_index":   "website",
            "_type":    "blog",
            "_id":      "EiwfApScQiiy7TIKFxRCTw",
            "_version": 1,
            "status":   201
      }},
      {  "update": {
            "_index":   "website",
            "_type":    "blog",
            "_id":      "123",
            "_version": 4,
            "status":   200
      }}
   ]
}}

<1>所有自请求都成功完成

每个子请求都独立的运行，只要有一个请求失败，顶层的error将标记为true，错误细节在请求报告中显示

如：

POST /_bulk
{ "create": { "_index": "website", "_type": "blog", "_id": "123" }}
{ "title":    "Cannot create - it already exists" }
{ "index":  { "_index": "website", "_type": "blog", "_id": "123" }}
{ "title":    "But we can update it" }

{
   "took": 3,
   "errors": true, <1>
   "items": [
      {  "create": {
            "_index":   "website",
            "_type":    "blog",
            "_id":      "123",
            "status":   409, <2>
            "error":    "DocumentAlreadyExistsException <3>
                        [[website][4] [blog][123]:
                        document already exists]"
      }},
      {  "index": {
            "_index":   "website",
            "_type":    "blog",
            "_id":      "123",
            "_version": 5,
            "status":   200 <4>
      }}
   ]
}

<1>一个或多个请求失败

<2>这个请求的状态码为409

<3>错误消息说明了什么请求错误

<4>请求成功