prometheus从零开始

本次的想法是做服务监控并告警主要线路如下图所示

1、运行prometheus docker方式

docker run -itd 
-p 9090:9090 
-v /opt/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml 
prom/prometheus

2、prometheus.yml 初始配置文件如下：

global:
  scrape_interval:     15s # By default, scrape targets every 15 seconds.   全局默认值 15秒抓取一次数据

  # Attach these labels to any time series or alerts when communicating with
  # external systems (federation, remote storage, Alertmanager).
  external_labels:
    monitor: 'codelab-monitor'

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'

    # Override the global default and scrape targets from this job every 5 seconds.
    scrape_interval: 5s

    static_configs:
      - targets: ['localhost:9090']

3、默认 prometheus 会有自己的指标接口http://192.168.246.2:9090/metrics 内容部分截取如下

# HELP go_gc_duration_seconds A summary of the pause duration of garbage collection cycles.
# TYPE go_gc_duration_seconds summary
go_gc_duration_seconds{quantile="0"} 2.6636e-05
go_gc_duration_seconds{quantile="0.25"} 0.000123346
go_gc_duration_seconds{quantile="0.5"} 0.000159706
go_gc_duration_seconds{quantile="0.75"} 0.000190857
go_gc_duration_seconds{quantile="1"} 0.001369042

4、可以登录9090端口去看看prometheus主界面可以执行PromQL (Prometheus Query Language) 来excute得到结果

比如这个 prometheus_target_interval_length_seconds{quantile="0.99"}

具体PromQL语法示例请参考官网https://prometheus.io/docs/prometheus/latest/querying/basics/

5、上面的数据是prometheus自己的，下面我们自己生产数据给它有很多公共的exporter 可以用比如 node_exporter 他可以暴露机器一些基本的通用指标。

也可以执行python自定义编程取指标让自己成为一个exporter

安装node_exporter 官网例子但是不要使用127.0.0.1 因为我的prometheus是docker起的和宿主机的127.0.0.1是不通的它抓取不到数据的，请改成实际的主机地址

ps:其他exporter 可参考地址 https://prometheus.io/docs/instrumenting/exporters/

tar -xzvf node_exporter-*.*.tar.gz
cd node_exporter-*.*

# Start 3 example targets in separate terminals:
./node_exporter --web.listen-address 127.0.0.1:8080
./node_exporter --web.listen-address 127.0.0.1:8081
./node_exporter --web.listen-address 127.0.0.1:8082

6、需要修改prometheus.yml增加job 抓取exproter 修改后的如下增加了一个job 里面有三个exporter 标签是随便配的

global:
  scrape_interval:     15s # By default, scrape targets every 15 seconds.

  # Attach these labels to any time series or alerts when communicating with
  # external systems (federation, remote storage, Alertmanager).
  external_labels:
    monitor: 'codelab-monitor'

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'

    # Override the global default and scrape targets from this job every 5 seconds.
    scrape_interval: 5s

    static_configs:
      - targets: ['localhost:9090']
  - job_name: 'node'

    # Override the global default and scrape targets from this job every 5 seconds.
    scrape_interval: 5s

    static_configs:
      - targets: ['192.168.246.2:8080', '192.168.246.2:8081']
        labels:
          group: 'production'

      - targets: ['192.168.246.2:8082']
        labels:
          group: 'canary'

6、查看页面是否ok了

7、可以看看node_exporter 暴露的指标例子比如有如下的

node_cpu_seconds_total{cpu="0",mode="idle"} 2963.27
node_cpu_seconds_total{cpu="0",mode="iowait"} 0.38
node_cpu_seconds_total{cpu="0",mode="irq"} 0
node_cpu_seconds_total{cpu="0",mode="nice"} 0
node_cpu_seconds_total{cpu="0",mode="softirq"} 0.35
node_cpu_seconds_total{cpu="0",mode="steal"} 0
node_cpu_seconds_total{cpu="0",mode="system"} 19.19
node_cpu_seconds_total{cpu="0",mode="user"} 16.96
node_cpu_seconds_total{cpu="1",mode="idle"} 2965.47
node_cpu_seconds_total{cpu="1",mode="iowait"} 0.37
node_cpu_seconds_total{cpu="1",mode="irq"} 0
node_cpu_seconds_total{cpu="1",mode="nice"} 0.03
node_cpu_seconds_total{cpu="1",mode="softirq"} 0.28
node_cpu_seconds_total{cpu="1",mode="steal"} 0
node_cpu_seconds_total{cpu="1",mode="system"} 18.42
node_cpu_seconds_total{cpu="1",mode="user"} 17.95

8、如果我们想看近5分钟内每个实例的所有cpus的平均每秒CPU时间速率可以这样写

avg by (job, instance, mode) (rate(node_cpu_seconds_total[5m]))

图例结果

9、下面设置一个rules规则，写一个文件 prometheus.rules.yml

groups:
- name: cpu-node
  rules:
  - record: job_instance_mode:node_cpu_seconds:avg_rate5m
    expr: avg by (job, instance, mode) (rate(node_cpu_seconds_total[5m]))

10、现在发现配置文件太多了我们重新用另一种方式启动docker 优化一下把本地配置文件都放在 /opt/prometheus/，原来的docker可删除了。

--web.enable-lifecycle 参数支持热更新 接口是curl -X POST http://192.168.246.2:9090/-/reload

docker run -itd -p 9090:9090 -v /opt/prometheus/:/etc/prometheus/ prom/prometheus --config.file=/etc/prometheus/prometheus.yml --web.enable-lifecycle

11、查看rules

12、上面只是规则，并没有告警，我们假设 avg by (job, instance, mode) (rate(node_cpu_seconds_total[5m])) > 0.5 就触发 cpu警告这是假设的测试

我们需要rules文件如下：

groups:
- name: example
  rules:
  - alert: HighCpuLatency
    expr: avg by (job, instance, mode) (rate(node_cpu_seconds_total[5m])) > 0.5
    for: 10s
    labels:
      severity: page
    annotations:
      summary: High request latency

写到prometheus.rules.yml文件中，注意groups不能复制进去 key不能重复可以使用命令检查rules文件是否正确

[root@test prometheus]# ./prometheus-2.26.0.linux-amd64/promtool check rules prometheus.rules.yml
Checking prometheus.rules.yml
  FAILED:
prometheus.rules.yml: yaml: unmarshal errors:
  line 6: mapping key "groups" already defined at line 1
prometheus.rules.yml: yaml: unmarshal errors:
  line 6: mapping key "groups" already defined at line 1


[root@test prometheus]# ./prometheus-2.26.0.linux-amd64/promtool check rules prometheus.rules.yml
Checking prometheus.rules.yml
  SUCCESS: 2 rules found

[root@test prometheus]#

13、热更新一下

curl -X POST http://192.168.246.2:9090/-/reload

这次不用重启docker了

查看页面 rules会增加一个且alert会先有pending状态，等符合条件后就触发告警

14、下面启动alertmanager

启动之前要做两个配置，首先把 alertmanager 的IP和端口配置到prometheus.yml中

最会面增加了 alerting的配置这样 prometheus 就连上 alertmanager

global:
  scrape_interval:     15s # By default, scrape targets every 15 seconds.

  # Attach these labels to any time series or alerts when communicating with
  # external systems (federation, remote storage, Alertmanager).
  external_labels:
    monitor: 'codelab-monitor'

rule_files:
  - 'prometheus.rules.yml'
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'

    # Override the global default and scrape targets from this job every 5 seconds.
    scrape_interval: 5s

    static_configs:
      - targets: ['localhost:9090']
  - job_name: 'node'

    # Override the global default and scrape targets from this job every 5 seconds.
    scrape_interval: 5s

    static_configs:
      - targets: ['192.168.246.2:8080', '192.168.246.2:8081']
        labels:
          group: 'production'

      - targets: ['192.168.246.2:8082']
        labels:
          group: 'canary'
alerting:
  alertmanagers:
    - static_configs:
        - targets: ["192.168.246.2:9093"]

第二个配置我们先测试邮件告警，写的 alertmanager 配置如下

注意修改的部分 qq如何申请授权码请百度一下

  smtp_smarthost: 'smtp.qq.com:465'
  smtp_from: '6171391@qq.com'
  smtp_auth_username: '6171391@qq.com'
  smtp_auth_password: 'qq授权码'
  smtp_require_tls: false
默认不做任何过滤选择的接收人

  - to: 'dfwl@163.com'

global:
  # The smarthost and SMTP sender used for mail notifications.
  smtp_smarthost: 'smtp.qq.com:465'
  smtp_from: '6171391@qq.com'
  smtp_auth_username: '6171391@qq.com'
  smtp_auth_password: 'qq授权码'
  smtp_require_tls: false
# The directory from which notification templates are read.
templates:
- '/etc/alertmanager/template/*.tmpl'

# The root route on which each incoming alert enters.
route:

  group_by: ['alertname', 'cluster', 'service']

  group_wait: 30s

  # When the first notification was sent, wait 'group_interval' to send a batch
  # of new alerts that started firing for that group.
  group_interval: 1m

  # If an alert has successfully been sent, wait 'repeat_interval' to
  # resend them.
  repeat_interval: 7h

  # A default receiver
  receiver: team-X-mails



  # The child route trees.
  routes:
  # This routes performs a regular expression match on alert labels to
  # catch alerts that are related to a list of services.
  - match_re:
      service: ^(foo1|foo2|baz)$
    receiver: team-X-mails
    # The service has a sub-route for critical alerts, any alerts
    # that do not match, i.e. severity != critical, fall-back to the
    # parent node and are sent to 'team-X-mails'
    routes:
    - match:
        severity: critical
      receiver: team-X-pager
  - match:
      service: files
    receiver: team-Y-mails

    routes:
    - match:
        severity: critical
      receiver: team-Y-pager

  # This route handles all alerts coming from a database service. If there's
  # no team to handle it, it defaults to the DB team.
  - match:
      service: database
    receiver: team-DB-pager
    # Also group alerts by affected database.
    group_by: [alertname, cluster, database]
    routes:
    - match:
        owner: team-X
      receiver: team-X-pager
      continue: true
    - match:
        owner: team-Y
      receiver: team-Y-pager


# Inhibition rules allow to mute a set of alerts given that another alert is
# firing.
# We use this to mute any warning-level notifications if the same alert is
# already critical.
inhibit_rules:
- source_match:
    severity: 'critical'
  target_match:
    severity: 'warning'
  # Apply inhibition if the alertname is the same.
  # CAUTION:
  #   If all label names listed in `equal` are missing
  #   from both the source and target alerts,
  #   the inhibition rule will apply!
  equal: ['alertname', 'cluster', 'service']


receivers:
- name: 'team-X-mails'
  email_configs:
  - to: 'dfwl@163.com'

- name: 'team-X-pager'
  email_configs:
  - to: 'team-X+alerts-critical@example.org'
  pagerduty_configs:
  - service_key: <team-X-key>

- name: 'team-Y-mails'
  email_configs:
  - to: 'team-Y+alerts@example.org'

- name: 'team-Y-pager'
  pagerduty_configs:
  - service_key: <team-Y-key>

- name: 'team-DB-pager'
  pagerduty_configs:
  - service_key: <team-DB-key>

15、手动启动测试一下生产环境可以docker或k8s等方式启动

./alertmanager --config.file=alertmanager.yml

16、alertmanager页面能同步到告警

过一会就会发邮件了

group_interval: 1m   意思 从第一次接受告警1m后还在就发