nagios云监控

(注：以下主要包括nagios安装，nagois配置，nagios对redis监控，nagios对mysql监控，nagios对zookeeper监控)

Nagios不但能够实现对系统CPU，磁盘、网络等方面参数的基本系统监测，而且还能够监测包括SMTP，POP3，HTTP，NNTP等各种基本的服务类型。另外通过一些插件的安装和监测脚本自定义用户可以针对自己的应用程序实现监测，并针对大量的监测主机和多个对象部署层次化的监测架构。

一、nagios安装

Nagios主节点需要安装:

nagios
nagios-plugin
nrpe
php
apache

Nagios从节点需要安装:

nagios-plugin
nrpe

NRPE说明：

NRPE外部构件监测远程主机。NRPE外部构件可以在远程的Linux/Unix主机上执行插件程序。如果是要象监测本地主机一样对远程主机的磁盘利用率、CPU负荷和内存占用率等情况下，NRPE外部构件将非常有用。
提到“外部构件”这个概念的时候需要说明一下，Nagios有许多"外部构件"软件包可供使用。外部构件可以扩展Nagios的应用并使之与其他软件集成，而且能够通过WEB接口来实现管理配置文件，监测远程主机（*NIX，Windows等），对远程主机的强制监测，减化并扩展告警逻辑等功能。
NRPE是一个可在远程Linux/Unix主机上执行的插件的外部构件包。如果你需要监测远程的主机上的本地资源或属性，如磁盘利用率、CPU负荷、内存利用率等时是很有用的。最终效果和用check_by_ssh插件来实现的功能一样，但是他不需要占用更多的监测主机的CPU负荷，所以当你需要监测大量的主机时这个构件将起到很重要的作用（如图pic35.png所示）。
通过该图可以看出，我们需要在被监测主机上部署NRPE，他相当于一个守护进程负责监听。而监测主机使用check_nrpe并通过SSL连接访问这个daemon，然后调用被监测方的check_disk，check_load等脚本获取信息并将结果传递到监测主机。同时这些脚本也有能力监测到其他主机的相关信息。

主机安装环境检查(全部节点)

# rpm -q gcc glibc glibc-common gd gd-devel xinetd openssl-devel

gcc-4.4.7-3.el6.x86_64

glibc-2.14.1-6.x86_64

glibc-common-2.14.1-6.x86_64

gd-2.0.35-11.el6.x86_64

package gd-devel is not installed

package xinetd is not installed

openssl-devel-1.0.0-27.el6.x86_64

若有缺失,请先安装. 可通过如下几个镜像网站下载相关安装包:

http://rpm.pbone.net/
http://mirrors.163.com/centos/6.4/os/x86_64/Packages/
http://mirrors.sohu.com/centos/6.4/os/x86_64/Packages/

创建nagios用户

useradd nagios -d /usr/local/nagios

passwd nagios (密码自定义)

主节点安装

一、nagios(下载:http://jaist.dl.sourceforge.net/project/nagios/nagios-4.x/nagios-4.0.2/nagios-4.0.2.tar.gz)

1、安装

tar -zxf nagios-4.0.2.tar.gz

cd nagios-4.0.2

./configure --prefix=/usr/local/nagios

make all

make install && make install-init && make install-commandmode && make install-config

2、将nagios添加为服务

chkconfig --add nagios

chkconfig nagios off

chkconfig --level 35 nagios on

chkconfig --list nagios

nagios 0:关闭 1:关闭 2:关闭 3:启用 4:关闭 5:启用 6:关闭

二、nagios插件(下载https://www.nagios-plugins.org/download/nagios-plugins-1.5.tar.gz)

tar -zxf nagios-plugins-1.5.tar.gz

cd nagios-plugins-1.5

./configure --prefix=/usr/local/nagios --with-nagios-user=nagios --with-nagios-group=nagios

make && make install

如果出现mysql相关的编译错误,是mysql的默认安装路径被修改导致的,调整with-mysql后重新make

./configure --prefix=/usr/local/nagios --with-mysql=/usr/local/mysql

make && make install

三、NRPE(下载http://jaist.dl.sourceforge.net/project/nagios/nrpe-2.x/nrpe-2.15/nrpe-2.15.tar.gz)

tar -zxf nrpe-2.15.tar.gz

cd nrpe-2.15

./configure --enable-command-args

make all

make install-plugin

被监控节点需要执行 make install-daemon && make install-daemon-config && make install-xinetd

四、Apache(下载http://archive.apache.org/dist/httpd/httpd-2.2.23.tar.gz)

tar -zxf httpd-2.2.23.tar.gz

cd httpd-2.2.23

./configure --prefix=/usr/local/apache2

make && make install

五、PHP(下载http://cn2.php.net/distributions/php-5.4.10.tar.gz)

cd /export/home/tools/soft/php

tar -zxf php-5.4.10.tar.gz

cd /php-5.4.10

./configure --prefix=/usr/local/php --with-apxs2=/usr/local/apache2/bin/apxs

make && make install

从节点安装

从借点安装上面二、三两部分就可以

二、Nagios配置

一、被监控节点配置（主从联系配置）：

1、更改/etc/xinetd.d/nrpe文件，设置允许nagios主节点服务器连接

vi /etc/xinetd.d/nrpe

only_from = 127.0.0.1 主节点IP

2、在/etc/services结尾增加：

nrpe 5666/tcp # NRPE

3、增加对参数的支持

vi /usr/local/nagios/etc/nrpe.cfg

dont_blame_nrpe=1

4、启动xinetd

service xinetd restart

5、验证nrpe是否监听

netstat -at | grep nrpe

6、测试nrpe是否正常运行

/usr/local/nagios/libexec/check_nrpe -H localhost

NRPE v2.15

7、主节点测试

/usr/local/nagios/libexec/check_nrpe -H 配置从节点的IP，返回版本信息表示成功

二、被监控节点命令配置：

1、修改配置文件

# su - nagios

$ vi /usr/local/nagios/etc/nrpe.cfg

修改为：

command[check_users]=/usr/local/nagios/libexec/check_users -w $ARG1$ -c $ARG2$

command[check_load]=/usr/local/nagios/libexec/check_load -w $ARG1$ -c $ARG2$

command[check_disk]=/usr/local/nagios/libexec/check_disk -w $ARG1$ -c $ARG2$ -p $ARG3$

command[check_procs]=/usr/local/nagios/libexec/check_procs -w $ARG1$ -c $ARG2$ -s $ARG3$

command[check_procs_args]=/usr/local/nagios/libexec/check_procs $ARG1$

command[check_swap]=/usr/local/nagios/libexec/check_swap -w $ARG1$ -c $ARG2$

check_users 监控登陆用户数
check_load 监控CPU负载
check_disk 监控磁盘的使用
check_procs 监控进程数量,状态包括 RSZDT
check_swap 监控SWAP分区使用

2、检查监控命令配置是否ok

service xinetd restart

/usr/local/nagios/libexec/check_nrpe -H localhost -c check_users  -a 5 10

/usr/local/nagios/libexec/check_nrpe -H localhost -c check_load   -a 15,10,5 30,25,20

/usr/local/nagios/libexec/check_nrpe -H localhost -c check_disk    -a 20% 10% /

/usr/local/nagios/libexec/check_nrpe -H localhost -c check_procs -a 200 400 RSZDT

/usr/local/nagios/libexec/check_nrpe -H localhost -c check_swap  -a 20% 10%

三、主节点配置（主从联系配置）：

1、定义权限

(使用 nagios 用户)

vi /usr/local/nagios/etc/cgi.cfg

修改如下内容,为admin用户增加权限:

default_user_name=admin

authorized_for_system_information=nagiosadmin,admin

authorized_for_configuration_information=nagiosadmin,admin

authorized_for_system_commands=nagiosadmin,admin

authorized_for_all_services=nagiosadmin,admin

authorized_for_all_hosts=nagiosadmin,admin

authorized_for_all_service_commands=nagiosadmin,admin

authorized_for_all_host_commands=nagiosadmin,admin

2、nagios.cfg

vi /usr/local/nagios/etc/nagios.cfg

1 2	`#cfg_file=/export/home/nagios/etc/objects/localhost.cfg (注释掉)` `cfg_dir=/export/home/nagios/etc/servers`

主配置文件声明了监控脚本的存储路径为 ./servers, 默认没有此目录,需要手工创建

nagios 会读取 servers 目录下面后缀为.cfg的全部文件作为配置文件

cd /usr/local/nagios/etc

mkdir servers

cd servers

3、定义监控组

声明一个监控的主机组,将主机环境中提到的三台主机全部加入监控

vi /export/home/nagios/etc/servers/group.cfg

新文件,内容如下:

define hostgroup{

hostgroup_name name

alias name

members name1,name2,name3

}

解释下上面的配置:

hostgroup_name: 主机组的名称,可随意指定
alias: 主机组别名,可随意指定
members: 主机组成员,多个主机名称之前使用逗号分隔.另外主机名称必须与 define host 中host_name 一致.

4、定义监控主机

先定义本地主机主机-1

vi /export/home/nagios/etc/servers/主机-1.cfg

define host{

use linux-server

host_name 主机-1

alias 主机-1

address 192.168.56.10

}

define service{

use local-service

host_name 主机-1

service_description Host Alive

check_command check-host-alive

}

define service{

use local-service

host_name 主机-1

service_description Users

check_command check_local_users!20!50

}

由于是此主机也是监控服务主节点所在主机,因此可以使用check_local_* 的相关命令来进行监控.

这个文件中已经将常用的监控项配置进去.

再定义远程主机主机2和主机-3

定义远程主机的监控之前,需要先定义check_nrpe命令

vi /usr/local/nagios/etc/objects/commands.cfg

在文件的最后面添加如下内容:

# 'check_nrpe' command definition

define command{

command_name check_nrpe

command_line $USER1$/check_nrpe -H $HOSTADDRESS$ -t 30 -c $ARG1$

}

define command{

command_name check_nrpe_args

command_line $USER1$/check_nrpe -H $HOSTADDRESS$ -t 30 -c $ARG1$ -a $ARG2$

}

下面的配置文件定义同上

5、定义邮件收件人

定义监控人邮件地址

vi /usr/local/nagios/etc/objects/contacts.cfg

define contact{

contact_name nagiosadmin ; Short name of user

use                             generic-contact         ; Inherit default values from generic-contact template (defined above)

alias Nagios Admin ; Full name of user

email yourname@domain.com

; <<***** CHANGE THIS TO YOUR EMAIL ADDRESS ******

}

除了配置监控邮件的接收人外,还要确保:

本主机与邮件服务器互通
本主机SendMail可以使用外部SMTP服务发送邮件

三、对redis的监控

首先安装:yum info perl5 yum install perl-Time-HiRes

1、下载check_redis.pl插件，放入libexec

2、etc/objects/commands.cfg加入：

# check redis

define command {

command_name check_redis

command_line $USER1$/check_redis.pl -H $HOSTADDRESS$ -p $ARG1$ -a $ARG2$ -w $ARG3$ -c $ARG4$ -f

}

3、监听配置文件加入

define service {

use local-service

service_description 描述名称

check_command 命令（如下）

host_name 主机名/IP

}

check_redis!端口!'监听内容（逗号隔开）'!（报警阀值）!（报警阀值） ;

监听内容参数翻译如下：

--total_connections_received=WARN:threshold,CRIT:threshold,<other specifiers>

Total Connections Received 收到总连接数

--total_connections_received_rate=WARN:threshold,CRIT:threshold,<other specifiers>

Rate of Change of Total Connections Received 总共收到的连接率

--total_expires=WARN:threshold,CRIT:threshold,<other specifiers>

Number of Expired Keys for All DBs dbs总过期密钥

--used_memory_rss=WARN:threshold,CRIT:threshold,<other specifiers>

Resident Set Size, Used Memory in Bytes

--used_cpu_sys=WARN:threshold,CRIT:threshold,<other specifiers>

Main Process Used System CPU CPU使用率

--redis_git_dirty=WARN:threshold,CRIT:threshold,<other specifiers>

Git Dirty Set Bit 脏数据

--connected_clients=WARN:threshold,CRIT:threshold,<other specifiers>

Total Number of Connected Clients 总连接数

--uptime_in_days=WARN:threshold,CRIT:threshold,<other specifiers>

Total Uptime in Days 总运行天数

--uptime_in_days_rate=WARN:threshold,CRIT:threshold,<other specifiers>

Rate of Change of Total Uptime in Days 总运行时间的变化率

--keyspace_hits=WARN:threshold,CRIT:threshold,<other specifiers>

Total Keyspace Hits

--keyspace_hits_rate=WARN:threshold,CRIT:threshold,<other specifiers>

Rate of Change of Total Keyspace Hits

--pubsub_channels=WARN:threshold,CRIT:threshold,<other specifiers>

Number of Pubsub Channels Pubsub通道数量

--used_cpu_user_children=WARN:threshold,CRIT:threshold,<other specifiers>

Child Processes Used User CPU 子进程用户CPU使用

--keyspace_misses=WARN:threshold,CRIT:threshold,<other specifiers>

Keyspace Misses

--keyspace_misses_rate=WARN:threshold,CRIT:threshold,<other specifiers>

Rate of Change of Keyspace Misses

--used_cpu_user=WARN:threshold,CRIT:threshold,<other specifiers>

Main Process Used User CPU

--total_commands_processed=WARN:threshold,CRIT:threshold,<other specifiers>

Total Number of Commands Processed from Start 从开始处理的命令总数量

--total_commands_processed_rate=WARN:threshold,CRIT:threshold,<other specifiers>

Rate of Change of Total Number of Commands Processed from Start

--mem_fragmentation_ratio=WARN:threshold,CRIT:threshold,<other specifiers>

Memory Fragmentation Ratio 记忆碎片比率

--blocked_clients=WARN:threshold,CRIT:threshold,<other specifiers>

Number of Currently Blocked Clients 目前阻止客户的数量

--evicted_keys=WARN:threshold,CRIT:threshold,<other specifiers>

Total Number of Evicted Keys 驱逐总数

--evicted_keys_rate=WARN:threshold,CRIT:threshold,<other specifiers>

Rate of Change of Total Number of Evicted Keys驱逐率

--total_keys=WARN:threshold,CRIT:threshold,<other specifiers>

Total Number of Keys on the Server

--expired_keys=WARN:threshold,CRIT:threshold,<other specifiers>

Total Number of Expired Keys

--expired_keys_rate=WARN:threshold,CRIT:threshold,<other specifiers>

Rate of Change of Total Number of Expired Keys

--connected_slaves=WARN:threshold,CRIT:threshold,<other specifiers>

Number of Connected Slaves

--used_cpu_sys_children=WARN:threshold,CRIT:threshold,<other specifiers>

Child Processed Used System CPU

四、对mysql的监控

三个插件:check_mysql/check_mysqld.pl/check_mysql_health，check_mysql_health比较完善，选取check_mysql_health；

check_mysql_health用法:

下载地址 https://labs.consol.de/nagios/check_mysql_health/

使用前提安装:yum -y install perl-DBD-MySQL

1、下载check_mysql_health-2.1.tar.gz

2、解压tar -zxvf check_mysql_health-2.1.tar.gz

3、安装

#./configure --prefix=/usr/local/nagios --with-nagios-user=nagios --with-nagios-group=nagios --with- perl=/usr/bin/perl
#make && make install

4、命令测试:

./check_mysql_health --hostname 192.168.0.1 --port 3306 --username myname --password mypassword --mode threads-connected --warning 700 --critical 1000

5、etc/objects/commands.cfg添加：

# check mysql health

define command {

command_name check_mysql_health

command_line $USER1$/check_mysql_health --hostname $ARG1$ --port $ARG2$ --username $ARG3$ --password $ARG4$ --mode $ARG5$ --warning $ARG6$ --critical $ARG7$

}

6、监控配置文件配置(同上)

监控参数:

  connection-time          (Time to connect to the server)
       uptime                   (Time the server is running)
       threads-connected        (Number of currently open connections)线程数
       threadcache-hitrate      (Hit rate of the thread-cache)慢查询
       slave-lag                (Seconds behind master)
       slave-io-running         (Slave io running: Yes)主从热备
       slave-sql-running        (Slave sql running: Yes)主从热备
       qcache-hitrate           (Query cache hitrate)
       qcache-lowmem-prunes     (Query cache entries pruned because of low memory)
       keycache-hitrate         (MyISAM key cache hitrate)
       bufferpool-hitrate       (InnoDB buffer pool hitrate)
       bufferpool-wait-free     (InnoDB buffer pool waits for clean page available)
       log-waits                (InnoDB log waits because of a too small log buffer)
       tablecache-hitrate       (Table cache hitrate)
       table-lock-contention    (Table lock contention)锁表率
       index-usage              (Usage of indices)
       tmp-disk-tables          (Percent of temp tables created on disk)
       slow-queries             (Slow queries)
       long-running-procs       (long running processes)
       cluster-ndbd-running     (ndnd nodes are up and running)
       sql                      (any sql command returning a single number)

7、/etc/init.d/nagios restart 重启nagios，若报进程被锁则需要删除/var/lock/subsys/nagios

五、对zookeeper的监控

一、安装插件

git clone https://github.com/harisekhon/nagios-plugins
cd nagios-plugins
make

二、插件说明

1、etc/objects/commands.cfg添加：

# check zk

define command {

command_name check_zk

command_line /exeport/home/nagios/nagios_plugins/check_zookeeper.pl -H $ARG1$

}

2、service中配置监控信息

注：若出现权限不够，需要修改权限为可执行