ELK的一次吞吐量优化

博客转载自：https://blog.csdn.net/ypc123ypc/article/details/69945031

问题一

最近发现kibana的日志传的很慢，常常查不到日志，由于所有的日志收集都只传输到了一个logstash进行收集和过滤，于是怀疑是否是由于logstash的吞吐量存在瓶颈。一看，还真是到了瓶颈。

优化过程

经过查询logstash完整配置文件，有几个参数需要调整

# pipeline线程数，官方建议是等于CPU内核数
pipeline.workers: 24
# 实际output时的线程数
pipeline.output.workers: 24
# 每次发送的事件数
pipeline.batch.size: 3000
# 发送延时
pipeline.batch.delay: 5

PS:由于我们的ES集群数据量较大（>28T），所以具体配置数值视自身生产环境

优化结果

ES的吞吐由每秒9817/s提升到41183/s

问题二

在查看logstash日志过程中，我们看到了大量的以下报错

[2017-03-18T09:46:21,043][INFO ][logstash.outputs.elasticsearch] retrying failed action with response code: 429 ({"type"=>"es_rejected_execution_exception", "reason"=>"rejected execution of org.elasticsearch.transport.TransportService$6@6918cf2e on EsThreadPoolExecutor[bulk, queue capacity = 50, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@55337655[Running, pool size = 24, active threads = 24, queued tasks = 50, completed tasks = 1767887463]]"}) [2017-03-18T09:46:21,043][ERROR][logstash.outputs.elasticsearch] Retrying individual actions

我们首先想到的是来调整ES的线程数，但是官网写到”Don’t Touch There Settings!”, 那怎么办？于是乎官方建议我们修改logstash的参数pipeline.batch.size

在ES5.0以后，es将bulk、flush、get、index、search等线程池完全分离，自身的写入不会影响其他功能的性能。
来查询一下ES当前的线程情况：
es性能存在问题一般需要查看es集群当前线程使用的情况：
GET _nodes/stats/thread_pool?pretty

{
  "_nodes": {
    "total": 6,
    "successful": 6,
    "failed": 0
  },
  "cluster_name": "dev-elasticstack5.0",
  "nodes": {
    "nnfCv8FrSh-p223gsbJVMA": {
      "timestamp": 1489804973926,
      "name": "node-3",
      "transport_address": "192.168.3.***:9301",
      "host": "192.168.3.***",
      "ip": "192.168.3.***:9301",
      "roles": [
        "master",
        "data",
        "ingest"
      ],
      "attributes": {
        "rack": "r1"
      },
      "thread_pool": {
        "bulk": {
          "threads": 24,
          "queue": 214,
          "active": 24,
          "rejected": 30804543,
          "largest": 24,
          "completed": 1047606679
        },

        ......

        "watcher": {
  "threads": 0,
  "queue": 0,
  "active": 0,
  "rejected": 0,
  "largest": 0,
  "completed": 0
}
}
}
}
}

其中：”bulk”模板的线程数24，当前活跃的线程数24，证明所有的线程是busy的状态，queue队列214，rejected为30804543。那么问题就找到了，所有的线程都在忙，队列堵满后再有进程写入就会被拒绝，而当前拒绝数为30804543。

优化方案

问题找到了，如何优化呢。官方的建议是提高每次批处理的数量，调节传输间歇时间。当batch.size增大，es处理的事件数就会变少，写入也就愉快了。
```
vim /etc/logstash/logstash.yml
#
pipeline.workers: 24
pipeline.output.workers: 24
pipeline.batch.size: 10000
pipeline.batch.delay: 10
```
具体的worker/output.workers数量建议等于CPU数，batch.size/batch.delay根据实际的数据量逐渐增大来测试最优值。

做完这些，世界又清净了。