prometheus的坑

prometheus是一个用于监控k8s集群状态的工具．今天在主机上配置这个东西，遇到了一个坑，调查了一段时间才解决，记之．

首先，根据网上的教程，利用helm安装这个东西很方便，只要三条指令（ref:https://itnext.io/kubernetes-monitoring-with-prometheus-in-15-minutes-8e54d1de2e13）

$ helm repo add coreos https://s3-eu-west-1.amazonaws.com/coreos-charts/stable/

$ helm install coreos/prometheus-operator --name prometheus-operator --namespace monitoring

$ helm install coreos/kube-prometheus --name kube-prometheus --set global.rbacEnable=true --namespace monitoring

但是，监控系统却没有正确的启动．经过一番调查，发现是有两个pod挂了，切到他们的container里面，进一步发现挂掉的container的

log信息是相同的：

再经过一番调查，在prometheus的文档中发现下面这段话：

github.com/coreos/prometheus-operator/vendor/github.com/fsnotify/fsnotify/README.md

How many files can be watched at once?

There are OS-specific limits as to how many watches can be created:

Linux: /proc/sys/fs/inotify/max_user_watches contains the limit, reaching this limit results in a "no space left on device" error.
BSD / OSX: sysctl variables "kern.maxfiles" and "kern.maxfilesperproc", reaching these limits results in a "too many open files" error.

原来是要达到了系统所允许的watch文件数目的上限．修改文件/proc/sys/fs/inotify/max_user_watches contains的值，再次部署，成功．