[译]使用explain API摆脱ElasticSearch集群RED苦恼（转）

"哔...哔...哗",PagerDuty的报警通知又来了. 可能是因为你又遭遇了节点宕机, 或者服务器机架不可用, 或者整个ElasticSearch集群重启了. 不管哪种情况, 当前集群的状态都成为了RED: 因为当前有些分片不可被指派(到某个节点), 从而导致部分数据不可用.

这种情况总会不期而至, 而你该怎么办!?

在ElasticSearch的早期版本中, 通常需要具有诸如爆破专家般的分析能力的人才能找到问题根源: 分片为何不可用!?. 你需要通过cluster state API, cat-shards API, cat-allocation API, cat-indices API, indices-recovery API, indices-shard-stores API等一系列API来判断集群状态并分析当前可能遇到的问题根源.

好在现在的情况大有改善, 只需要一个cluster-allocation-explain API, 你就能轻松分析当前的分片分配情况.

cluster-allocation-explain API在ElasticSearch 5.0中初次引入,并在5.2版本中进行了重构. 这个API主要是为了方便解决下面两个问题:

对于不能指派(unassigned)的分片: 解释这些分片不能被指派(到某个节点)的原因.
对于已指派的分片: 解决这些分片指派到特定节点的理由.

需要注意的是, 分片分配的问题不应该在集群中经常发生, 通常是节点或集群配置问题所致(例如, 设置了错误的分片分配过滤参数), 或者集群中的节点都保存了分片的副本却互相连接不到, 又或者磁盘问题等等诸如此类. 当问题出现时, 集群管理员需要使用恰当的工具来定位问题, 并把集群恢复到健康状态, 而这正是cluster allocation explain API将要带给我们的.

本文目标就是通过几个具体的示例给大家讲述如何使用explain API来定位分片分配相关的问题.

什么是分片分配

分片分配就是把一个分片指派到集群中某个节点的过程. 为了能处理大规模的文档数据,提供高可用的集群能力, ElasticSearch把索引中的文档拆分成分片, 并把分片分配到集群中的不同节点.

当主分片(primary shard)分配失败时, 将会导致索引的数据丢失以及不能为该索引写入新的数据.
当副本分片(replica shard)分配失败时, 如果相应的主分片彻底坏掉(例如磁盘故障)时, 集群将面临数据丢失的困境.
当分片分配到较慢的节点上时, 数据传输量大的索引将因为这些较慢分片而遭受影响, 从而导致集群的性能降低.

因此, 分配分片并指派到最优的节点无疑是ElasticSearch内部一项重要的基础功能.

对于新建索引和已有索引, 分片分配过程也不尽相同. 不过不管哪种场景, ElasticSearch都通过两个基础组件完成工作: allocators和deciders. Allocators尝试寻找最优的节点来分配分片, deciders则负责判断并决定是否要进行这次分配.

对于新建索引, allocators负责找出拥有分片数最少的节点列表, 并按分片数量增序排序, 因此分片较少的节点会被优先选择. 所以对于新建索引, allocators的目标就是以更为均衡的方式为把新索引的分片分配到集群的节点中. 然后deciders依次遍历allocators给出的节点, 并判断是否把分片分配到该节点. 例如, 如果分配过滤规则中禁止节点A持有索引idx中的任一分片, 那么过滤器也阻止把索引idx分配到节点A中, 即便A节点是allocators从集群负载均衡角度选出的最优节点. 需要注意的是allocators只关心每个节点上的分片数, 而不管每个分片的具体大小. 这恰好是deciders工作的一部分, 即阻止把分片分配到将超出节点磁盘容量阈值的节点上.
对于已有索引, 则要区分主分片还是副本分片. 对于主分片, allocators只允许把主分片指定在已经拥有该分片完整数据的节点上. 如果allocators不这样做, 并把主分片分配到那些没有最新数据的节点上, 则集群将不得不面临数据丢失的困境. 而对于副本分片, allocators则是先判断其他节点上是否已有该分片的数据的拷贝(即便数据不是最新的). 如果有这样的节点, allocators就优先把把分片分配到这其中一个节点. 因为副本分片一旦分配, 就需要从主分片中进行数据同步, 所以当一个节点只拥分片中的部分时, 也就意思着那些未拥有的数据必须从主节点中复制得到. 这样可以明显的提高副本分片的数据恢复过程.

诊断不可指派的主分片

出现不可指派的主分片大概是ElasticSearch中最糟糕的事情之一. 如果未指派的分片出现在新创建的索引, 则将不能向该分片索引数据; 如果出现在已有索引中, 则不但不能索引数据, 并且之前已索引的数据也将不可被搜索.

我们先在一个拥有两个节点A和B的集群中创建一个名为test_idx的索引, 为该索引只设定1个分片且不设置副本分片. 但在创建索引的时候, 为其设置了分配过滤规则, 即该索引不能出现在节点A和B上. 索引创建命名如下:

PUT /test_idx?wait_for_active_shards=0
{
    "settings":
    {
        "number_of_shards": 1,
        "number_of_replicas": 0,
        "index.routing.allocation.exclude._name": "A,B"
    }
}

虽然索引能创建成功, 但因为过滤规则的限制, 该索引中任何分片都不能分配到所在集群的仅有的两个节点A和B上. 这个例子是我们人为设置的, 听起来真实场景中也许不可能发生. 但确实会存在因为分配过滤相关设置的错误配置而导致分片无法指派.

此时, 集群将处于RED状态. 这时候我们就可以通过explain API来获得第一个未指派的分片的一些情况(上面例子中, 集群中只有一个分片未进行指派).

GET /_cluster/allocation/explain

输出信息如下:

{
  "index" : "test_idx",
  "shard" : 0,
  "primary" : true,
  "current_state" : "unassigned",
  "unassigned_info" : {
    "reason" : "INDEX_CREATED", 
    "at" : "2017-01-16T18:12:39.401Z",
    "last_allocation_status" : "no"
  },
  "can_allocate" : "no",
  "allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes",   
  "node_allocation_decisions" : [ 
    {
      "node_id" : "tn3qdPdnQWuumLxVVjJJYQ",
      "node_name" : "A", 
      "transport_address" : "127.0.0.1:9300",
      "node_decision" : "no",
      "weight_ranking" : 1,
      "deciders" : [
        {
          "decider" : "filter",  
          "decision" : "NO", 
          "explanation" : "node matches index setting [index.routing.allocation.exclude.] filters [_name:"A OR B"]" 
        }
      ]
    },
    {
      "node_id" : "qNgMCvaCSPi3th0mTcyvKQ",
      "node_name" : "B", 
      "transport_address" : "127.0.0.1:9301",
      "node_decision" : "no",
      "weight_ranking" : 2,
      "deciders" : [
        {
          "decider" : "filter",
          "decision" : "NO",
          "explanation" : "node matches index setting [index.routing.allocation.exclude.] filters [_name:"A OR B"]"
        }
      ]
    }
  ]
}

explain API对索引test_idx中的第一个主分片0进行了解释: 因为索引刚刚创建(unassigned_info所示), 所以还处于未指派状态(current_state所示). 但又因为没有节点被允许分配给该分片(allocate_explanation所示), 所以分片处于不可分配状态(can_allocate所示). 继续看每个节点的决策信息(node_allocation_decisions), 可以看到因为创建索引时过滤了节点A和节点B, 所以filter decider(decider所示)给A发出的决定是不允许在A上分配分片('node_decision'所示, decider的explanation也对此做了说明). 在解释中也包含了改变当前状态需要调整的配置参数.

通过下面的_settings API来更新分配过滤配置:

PUT /test_idx/_settings
{
    "index.routing.allocation.exclude._name": null
}

然后再次执行explain API将收到如下信息:

unable to find any unassigned shards to explain

也就是是当前已没有未指派到节点的分片了, 因为索引test_idx中唯一的一个分片已经成功分配过了. 如果只对主分片执行explain API, 如下(注意这里是GET请求):

GET /_cluster/allocation/explain
{
    "index": "test_idx",
    "shard": 0,
    "primary": true
}

则将返回该分片被指派到的节点信息(对输出信息做了缩减):

{
    "index": "test_idx",
    "shard": 0,
    "primary": true,
    "current_state": "started",
    "current_node": {
        "id" : "tn3qdPdnQWuumLxVVjJJYQ",
        "name" : "A",
        "transport_address" : "127.0.0.1:9300",
        "weight_ranking" : 1
    }
}

可以看出该分片已处于分配成功状态(started), 并且被指派到了节点A上.

好了, 让我们开始向索引test_idx中写入些数据, 然后主分片上就拥有了一些文档. 这时候如果停掉节点A,那么这个主分片也将随之消失. 因为开始时设置不创建副本分片, 所以集群状态又会变成RED. 重新对主分片执行explain API:

GET /_cluster/allocation/explain
{
    "index": "test_idx",
    "shard": 0,
    "primary": true
}

将返回如下信息:

{
  "index" : "test_idx",
  "shard" : 0,
  "primary" : true,
  "current_state" : "unassigned",
  "unassigned_info" : {             
    "reason" : "NODE_LEFT",    
    "at" : "2017-01-16T17:24:21.157Z",
    "details" : "node_left[qU98BvbtQu2crqXF2ATFdA]",
    "last_allocation_status" : "no_valid_shard_copy"
  },
  "can_allocate" : "no_valid_shard_copy", 
  "allocate_explanation" : "cannot allocate because a previous copy of the primary shard existed but can no longer be found on the nodes in the cluster" 
}

输出信息告诉我们主分片当前处于未指派状态(current_state), 因为之前分配了该分片的节点已从集群中离开(unassigned_info). unassigned_info告诉我们当前不能分配分片的原因是集群中没有该分片的可用备份数据(can_allocate), allocate_explanation给出了更详细的信息.

explain API告知我们那个主分片已没有任何可用的分片复制数据, 也就是说集群中任一拥有该分片可用的复制信息的节点都不存在了. 当前唯一能做的事就是等待节点恢复并重新加入集群. 在一些更极端场景, 这些节点被永久移除, 而此时只能接受数据丢失的事实, 并通过reroute commends来重新分配空的主分片.

诊断不可指派的副本分片

回到上面的索引test_idx, 并把其副本分片数增加到1:

PUT /test_idx/_settings
{
    "number"_of_replicas": 1
}

然后对于test_idx, 我们就拥有了2个分片: 主分片shard 0和副本分片shard 0. 因为节点A上已经分配了主分片, 所以副本分片应该指派到节点B上, 以达到集群的分配均衡. 现在对副本分片执行explain API(这里也是GET请求):

GET /_cluster/allocation/explain
{
    "index": "test_idx",
    "shard": 0,
    "primary": false
}

输出结果如下:

{
  "index" : "test_idx",
  "shard" : 0,
  "primary" : false,
  "current_state" : "started",
  "current_node" : {
    "id" : "qNgMCvaCSPi3th0mTcyvKQ",
    "name" : "B",
    "transport_address" : "127.0.0.1:9301",
    "weight_ranking" : 1
  },
  …
}

结果显示副本分片已经被分配到节点B上.

接下来, 我们再在该索引上设置分片分配过滤, 不过这次我们只阻止向节点B分配分片数据:

PUT /text_idx/_settings
{
    "index.routing.allocation.exclude._name": "B"
}

重启节点B, 然后重新为副本节点执行explain API, 这时候的结果如下:

{
  "index" : "test_idx",
  "shard" : 0,
  "primary" : false,
  "current_state" : "unassigned",
  "unassigned_info" : {
    "reason" : "NODE_LEFT",
    "at" : "2017-01-16T19:10:34.478Z",
    "details" : "node_left[qNgMCvaCSPi3th0mTcyvKQ]",
    "last_allocation_status" : "no_attempt"
  },
  "can_allocate" : "no", 
  "allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes",
  "node_allocation_decisions" : [
    {
      "node_id" : "qNgMCvaCSPi3th0mTcyvKQ",
      "node_name" : "B",
      "transport_address" : "127.0.0.1:9301",
      "node_decision" : "no",
      "deciders" : [
        {
          "decider" : "filter",  
          "decision" : "NO",
          "explanation" : "node matches index setting [index.routing.allocation.exclude.] filters [_name:"B"]" 
        }
      ]
    },
    {
      "node_id" : "tn3qdPdnQWuumLxVVjJJYQ",
      "node_name" : "A",
      "transport_address" : "127.0.0.1:9300",
      "node_decision" : "no",
      "deciders" : [
        {
          "decider" : "same_shard",  
          "decision" : "NO",
          "explanation" : "the shard cannot be allocated to the same node on which a copy of the shard already exists [[test_idx][0], node[tn3qdPdnQWuumLxVVjJJYQ], [P], s[STARTED], a[id=JNODiTgYTrSp8N2s0Q7MrQ]]" 
        }
      ]
    }
  ]
}

结果显示副本分片当前处于不可分配状态(can_allocate), 因为分配过滤规则设置了禁止把分片分配到节点B上(explanation). 因为节点A上已经指派了主分片, 所以不允许再把该分片的其他备份信息指派到A节点(explanation)--因为在同一台机器上分配两份完全相同的数据没有什么意义, 所以ElasticSearch拒绝这样做.

剖析已指派的分片

如果分片能正常分配, 为什么还要关注它的explain信息呢? 通常的理由也许是某个索引(主索引或副本索引)已经分配到一个节点, 然后你又通过分配过滤设置希望把该分片从当前节点移到另外一个节点上(也许你正想尝试hot-warm架构), 但出于一些其他原因, 这个分片依然驻留在当前节点上. 这也正是explain API能帮助我们清晰当前分片分配过程的重要场景.

下面我们先清除掉索引test_idx的分配过滤设置, 以允许主分片和副本分片都可以正常分配:

PUT /test_idx/_settings
{
    "index.routing.allocation.exclude._name": null
}

现在我们重新设置过滤规则, 以使主分片从当前节点移出:

PUT /test_idx/_settings
{
    "index.routing.allocation.exclude._name": "A"
}

我们期望的结果是该过滤规则使主分片从当前的节点A中移出到另一个节点, 然而却事与愿违. 下面通过explain API来分析其中的原由:

GET /_cluster/allocation/explain
{
    "index": "test_idx",
    "shard": 0,
    "primary": true
}

输出结果如下:

{
  "index" : "test_idx",
  "shard" : 0,
  "primary" : true,
  "current_state" : "started",
  "current_node" : {
    "id" : "tn3qdPdnQWuumLxVVjJJYQ",
    "name" : "A",  
    "transport_address" : "127.0.0.1:9300"
  },
  "can_remain_on_current_node" : "no", 
  "can_remain_decisions" : [   
    {
      "decider" : "filter",
      "decision" : "NO",
      "explanation" : "node matches index setting [index.routing.allocation.exclude.] filters [_name:"A"]"   
    }
  ],
  "can_move_to_other_node" : "no", 
  "move_explanation" : "cannot move shard to another node, even though it is not allowed to remain on its current node",
  "node_allocation_decisions" : [
    {
      "node_id" : "qNgMCvaCSPi3th0mTcyvKQ",
      "node_name" : "B",
      "transport_address" : "127.0.0.1:9301",
      "node_decision" : "no",
      "weight_ranking" : 1,
      "deciders" : [
        {
          "decider" : "same_shard", 
          "decision" : "NO",
          "explanation" : "the shard cannot be allocated to the same node on which a copy of the shard already exists [[test_idx][0], node[qNgMCvaCSPi3th0mTcyvKQ], [R], s[STARTED], a[id=dNgHLTKwRH-Dp-rIX4Hkqg]]" 
        }
      ]
    }
  ]
}

通过对结果的分析, 我们看到主分片依然驻留在节点A(current_node). 虽然集群明确表示该分片已不应该再继续滞留在当前节点(can_remain_on_current_node), 理由是当前节点符合设置的分配过滤规则(can_remain_decisions). 然而explain API还表示该分片也不能被分配到另外一个节点(can_move_to_other_node), 因为集群只有唯一一个另外的节点(节点B), 并且节点B上已经有了一份副本分片, 而同一份数据并不允许同时在一个节点上分配多次, 所以主分片当前不能被移到B上, 从而也不能从节点A上移出(node_allocation_decisions).

总结

在这篇文章中, 我们通过对三个不同的场景的介绍, 来帮忙ElasticSearch管理员通过explain API来理解集群中的分片分配过程.explain API还有其他的一些使用场景, 例如通过展示节点的权重以解释分片为何处于当前节点而未被均衡到其他节点. explain API是诊断生产环境集群分片分配过程的一件利器, 即便在ElasticSearch的开发过程中我们已经得到巨大的帮助并节省了很多时间, 同时我们的很多客户也通过explain API在诊断集群状态过程中受益匪浅.

原文地址：https://www.elastic.co/blog/red-elasticsearch-cluster-panic-no-longer