ELK日志处理

ELK的工作原理：
使用多播进行机器发现同一个集群内的节点，并汇总各个节点的返回组成一个集群，主节点要读取各个节点的状态，在关键时候进行数据的恢复，主节点会坚持各个节点的状态，并决定每个分片的位置，通过ping的request检测各失效的节点.

ELK架构：

ElasticSearch:用于存储、索引日志.
Logstash:用于收集、处理和转发事件或日志信息的工具.
Kibana:搜索和可视化的日志的WEB界面.

ELK优点：
a.处理方式灵活：ElasticSearch是实时全文索引.
b.配置简单易上手.
c.检索性能高效:虽然每次计算都是实时计算的，但是优秀的设计基本可以达到全天数据查询的秒级响应.
d.集群线性扩展：ElasticSearch和Logstash集群都是可以线性扩展的.
e.前端操作绚丽：Kibana界面上，只需要点击鼠标，就可以完成搜索、聚合功能，生成绚丽的仪表板.

0.安装前准备：
ElasticSearch和Logstash需要java环境，需要安装JDK1.7以上的版本.
a.下载JDK的rpm包
b.安装
c.Java -version :检测安装的JDK

Elasticsearch：

概念：
1.索引：数据会放在多个索引中，索引可以理解为database,索引里面存放的基本单位是文档，elasticsearch会把索引分片，便于横向扩展，分别可以做备份，多个分片读比较快，备份分片在主的挂掉之后可以自动将自己提升为主分片（实现横向扩展和冗余）
2.文档类型：和redis一样，key是有类型的
3.节点：一个elasticsearch的实例是一个节点
4.集群：多节点的集合组成集群，类似于zookeeper会选举出主节点，客户端不需要关注主节点，连接任何一个都可以，数据会自动同步.

安装Elasticsearch
a.wget https://download.elastic.co/elasticsearch/release/org/elasticsearch/distribution/rpm/elasticsearch/2.3.5/elasticsearch-2.3.5.rpm
b.rpm -ivh elasticsearch-2.3.5.rpm
c.mkdir /opt/develop/elasticsearch/data -p
mkdir /opt/develop/elasticsearch/log -p
d.# vi /usr/share/elasticsearch/config/elasticsearch.yml
Cluster.name:my-application --集群的名称，名称相同就是一个集群
Node.name:node-1 --集群情况下，当前node的名字，每个node应该不一样
Path.data=/opt/develop/elasticsearch/data
Path.log=/opt/develop/elasticsearch/log
Network.host=xxx.xxx.xx.xx
http.port:9200 --客户端访问端口
node.max_local_storage_nodes: 1
e.ElasticSearch需要使用非root用户启动服务
Groupadd ela
Useradd ela -g ela -p xxx
Su – ela
执行安装路径下的/elasticsearch启动服务
f.curl -X GET http://localhost:9200/ 查看ElasticSearch的安装信息----启动成功

g.chkconfig –add elasticsearch
Elasticsearch集群：
1.基于http的restful API：以jsop返回查询结果：
$curl -XGET http://10.26.44.42:9200/_count?pretty -d '

{
"query":{
"match_all":{}
}
}
'
{
"count" : 308590265,
"_shards" : {
"total" : 4180,
"successful" : 4180,
"failed" : 0
}
}

安装Logstash
a.wget https://download.elastic.co/logstash/logstash/packages/centos/logstash-2.3.4-1.noarch.rpm
b.安装
c.启动服务
d.测试：cd /opt/logstash/bin
./logstash -e ‘input { stdin {} } output { stdout {} }’

e.使用ruby进行更详细的输出：
./logstash -e 'input { stdin {}} output { stdout{codec => rubydebug}}'
Settings: Default pipeline workers: 8
Pipeline main started
asd
{
"message" => "asd",
"@version" => "1",
"@timestamp" => "2017-02-13T08:39:56.079Z",
"host" => "ali-hk-ops-elk1"
}

f.通过logstash将输出交给elasticsearch:
./logstash -e ‘input { stdin{} } output { elasticsearch { host => “ali-hk-ops-elk1:9200”protocol => “http”} }’
g.配置文件格式：
input {
file {
path => “/var/log/messages”
type => “syslog”
}

file {
path => “/var/log/apache/access.log”
type => “apache”

}
}

Logstash的input使用语法：
1.input,默认不支持目录的递归，即目录中还有文件是不支持直接读取的，但是可以使用/进行匹配
2.Exclude---->排除文件
Exclude => “*.gz”
3.sincedb_path,记录读取的时候位置，默认是一个隐藏文件
4.Sincedb_write_interval,记录sincedb_path文件的写间隔，默认是15秒
5.Start_position,从这个文件的什么位置开始读，默认是end，可以改成beginning
6.start_interval,多久检测一次此文件的更新状态

logstash的output使用及插件:
1.可以输出到文件、redis等
2.gzip，是否压缩，默认为false，压缩是安装数据流一点点增量压缩的
3.Message_format,消息的格式

Logstash-->file-->elasticsearch:
通过logstash输出到文件再输出到elasticsearch；
1.启动脚本：
Vim /etc/init.d/logstash
-#!/bin/sh
-# Init script for logstash
-# Maintained by Elasticsearch
-# Generated by pleaserun.
-# Implemented based on LSB Core 3.1:
-# * Sections: 20.2, 20.3
-#
-### BEGIN INIT INFO
-# Provides: logstash
-# Required-Start: $remote_fs $syslog
-# Required-Stop: $remote_fs $syslog
-# Default-Start: 2 3 4 5
-# Default-Stop: 0 1 6
-# Short-Description:
-# Description: Starts Logstash as a daemon.
-### END INIT INFO

PATH=/sbin:/usr/sbin:/bin:/usr/bin
export PATH

if [ id -u -ne 0 ]; then
echo "You need root privileges to run this script"
exit 1
fi

name=logstash
pidfile="/var/run/$name.pid"

LS_USER=logstash
LS_GROUP=logstash
LS_HOME=/var/lib/logstash
LS_HEAP_SIZE="4g"
LS_LOG_DIR=/var/log/logstash
LS_LOG_FILE="${LS_LOG_DIR}/$name.log"
LS_CONF_DIR=/etc/logstash/conf.d
LS_OPEN_FILES=16384
LS_NICE=19
KILL_ON_STOP_TIMEOUT=${KILL_ON_STOP_TIMEOUT-0} #default value is zero to this variable but could be updated by user request
LS_OPTS=""

[ -r /etc/default/$name ] && . /etc/default/$name
[ -r /etc/sysconfig/$name ] && . /etc/sysconfig/$name

program=/opt/logstash/bin/logstash
args="agent -f ${LS_CONF_DIR} -l ${LS_LOG_FILE} ${LS_OPTS}"

quiet() {
"$@" > /dev/null 2>&1
return $?
}

start() {

LS_JAVA_OPTS="${LS_JAVA_OPTS} -Djava.io.tmpdir=${LS_HOME}"
HOME=${LS_HOME}
export PATH HOME LS_HEAP_SIZE LS_JAVA_OPTS LS_USE_GC_LOGGING LS_GC_LOG_FILE

-# chown doesn't grab the suplimental groups when setting the user:group - so we have to do it for it.
-# Boy, I hope we're root here.
SGROUPS=$(id -Gn "$LS_USER" | tr " " "," | sed 's/,$//'; echo '')

if [ ! -z $SGROUPS ]
then
EXTRA_GROUPS="--groups $SGROUPS"
fi

-# set ulimit as (root, presumably) first, before we drop privileges
ulimit -n ${LS_OPEN_FILES}

-# Run the program!
nice -n ${LS_NICE} chroot --userspec $LS_USER:$LS_GROUP $EXTRA_GROUPS / sh -c "
cd $LS_HOME
ulimit -n ${LS_OPEN_FILES}
exec "$program" $args
" > "${LS_LOG_DIR}/$name.stdout" 2> "${LS_LOG_DIR}/$name.err" &

-# Generate the pidfile from here. If we instead made the forked process
-# generate it there will be a race condition between the pidfile writing
-# and a process possibly asking for status.
echo $! > $pidfile

echo "$name started."
return 0
}

stop() {
-# Try a few times to kill TERM the program
if status ; then
pid=cat "$pidfile"
echo "Killing $name (pid $pid) with SIGTERM"
kill -TERM $pid
-# Wait for it to exit.
for i in 1 2 3 4 5 6 7 8 9 ; do
echo "Waiting $name (pid $pid) to die..."
status || break
sleep 1
done
if status ; then
if [ $KILL_ON_STOP_TIMEOUT -eq 1 ] ; then
echo "Timeout reached. Killing $name (pid $pid) with SIGKILL. This may result in data loss."
kill -KILL $pid
echo "$name killed with SIGKILL."
else
echo "$name stop failed; still running."
return 1 # stop timed out and not forced
fi
else
echo "$name stopped."
fi
fi
}

status() {
if [ -f "$pidfile" ] ; then
pid=cat "$pidfile"
if kill -0 $pid > /dev/null 2> /dev/null ; then
-# process by this pid is running.
-# It may not be our pid, but that's what you get with just pidfiles.
-# TODO(sissel): Check if this process seems to be the same as the one we
-# expect. It'd be nice to use flock here, but flock uses fork, not exec,
-# so it makes it quite awkward to use in this case.
return 0
else
return 2 # program is dead but pid file exists
fi
else
return 3 # program is not running
fi
}

reload() {
if status ; then
kill -HUP cat "$pidfile"
fi
}

force_stop() {
if status ; then
stop
status && kill -KILL cat "$pidfile"
fi
}

configtest() {
-# Check if a config file exists
if [ ! "$(ls -A ${LS_CONF_DIR}/* 2> /dev/null)" ]; then
echo "There aren't any configuration files in ${LS_CONF_DIR}"
return 1
fi

HOME=${LS_HOME}
export PATH HOME

test_args="--configtest -f ${LS_CONF_DIR} ${LS_OPTS}"
$program ${test_args}
[ $? -eq 0 ] && return 0
-# Program not configured
return 6
}

case "$1" in
start)
status
code=$?
if [ $code -eq 0 ]; then
echo "$name is already running"
else
start
code=$?
fi
exit $code
;;
stop) stop ;;
force-stop) force_stop ;;
status)
status
code=$?
if [ $code -eq 0 ] ; then
echo "$name is running"
else
echo "$name is not running"
fi
exit $code
;;
reload) reload ;;
restart)

quiet configtest
RET=$?
if [ ${RET} -ne 0 ]; then
  echo "Configuration error. Not restarting. Re-run with configtest parameter for details"
  exit ${RET}
fi
stop && start
;;

configtest)
configtest
exit $?
;;
*)
echo "Usage: $SCRIPTNAME {start|stop|force-stop|status|reload|restart|configtest}" >&2
exit 3
;;
esac

exit $?

分析的日志类型：
1.系统日志：/var/log下的所有的内容，google每一个文件的内容
2.通过elasticsearch分析某一个访问记录
3.错误日志，收集后反馈给开发
4.系统运行日志
5.其他类型的日志

日志的字段划分：
1.gork模块：通过正则表达式，比较复杂，而且当数据大的时候会占用CPU
2.Json,简单易用
3.将nginx的日志设置为json模式

安装kibana
a.wget https://download.elastic.co/kibana/kibana/kibana-4.5.4-1.x86_64.rpm
b.安装
c.vi /opt/kibana/config/kibana.yml
server.port:5601
server.host:’0.0.0.0’
elasticsearch.url:’http://xxx.xxx.xx.xx:9200’
d.service kibana start
e.chkconfig –add kibana
f.访问网页：http://localhost:5601

常用模块：
1.系统日志收集--->syslog：配置syslog结果写入到elasticsearch,指定端口514，主机就是要收集日志的服务器IP地址
2.访问日志：nginx转换成json格式
3.错误日志：使用codec插件：
Input {
Stdin {
Codec =>multiline {
Pattern => “^s”
Negate => “false”
What => “previous”
}
}
}
Pattern:使用正则表达式匹配文件.
Negate的默认值为false,当设置为true的时候，不匹配pattern的信息会继续执行what的内容.
What：值为previous或next：将匹配到的信息合并到前一行还是下一行.
4.运行日志codec =>json,如果不是json要使用gork进行匹配

在地图显示IP的访问次数统计：
1.在easticsearch服务器用户家目录下载一个filebeat：
2.加载模板：
$curl -XPUT 'http://10.26.44.42:9200/_template/filebeat?pretty' -d@/etc/filebeat/filebeat.template.json
$curl -XPUT 'http://10.26.44.42:9200/_template/filebeat?pretty' -d@/etc/filebeat/filebeat.template-es2x.json
$curl -XPUT 'http://10.26.44.42:9200/_template/filebeat?pretty' -d@/root/filebeat.template.json
3.下载GeoIP数据库文件：
$cd /opt/logstash
$curl -o “http://geolite.maxmind.com/download/geoip/database/GeoLiteCity.dat.gz”
$gunzip GeoLiteCity.dat.gz
4.配置logstash使用GeoIP:
input {
redis {
data_type => "list"
key => "mobile-tomcat-access-log"
host => "192.168.0.251"
port => "6379"
db => "0"
codec => "json"
}
}

--#input部分为从redis读取客户端logstash分析提交后的访问日志

filter {
if [type] == "mobile-tomcat" {
geoip {
source => "client" --client 是客户端logstash收集日志时定义的公网IP的key名称，一定要和实际名称一致，因为要通过此名称获取到其对于的ip地址
target => "geoip"
database => "/etc/logstash/GeoLiteCity.dat"
add_field => [ "[geoip][coordinates]", "%{[geoip][longitude]}" ]
add_field => [ "[geoip][coordinates]", "%{[geoip][latitude]}" ]
}
mutate {
convert => [ "[geoip][coordinates]", "float"]
}
}
}

output {
if [type] == "mobile-tomcat" {
elasticsearch {
hosts => ["192.168.0.251"]
manage_template => true
index => "logstash-mobile-tomcat-access-log-%{+YYYY.MM.dd}" --index的名称一定要是logstash开头的，否则会在使用地图的时候出现geoIP type无法找找到的类似错误
flush_size => 2000
idle_flush_time => 10
}
}
}
5.在kibana界面添加新的索引：
visualize---->Tile map---->From a new search---->Select a index patterm--->选择之前的index---->Geo coordinates

【参考文档：】
1.https://www.elastic.co/guide/index.html
2.http://www.ttlsa.com/elk/howto-install-elasticsearch-logstash-and-kibana-elk-stack/

3.https://www.elastic.co/guide/en/logstash/current/plugins-inputs-log4j.html

4.http://blog.chinaunix.net/xmlrpc.php?r=blog/article&uid=21142030&id=5671032

http://blog.csdn.net/super_scan/article/details/45694289

6.http://517sou.net/archives/centos下使用elk套件搭建日志分析和监控平台/

问题：
1.重新启动elasticsearch后，报错：Elasticsearch is still initializing the kibana index.
解决：curl -XDELETE http://localhost:9200/.kibana
---上述方法会丢失所有的kibana配置，索引、图、仪表板，如果只是区分索引，可使用以下方法：
curl -s http://localhost:9200/.kibana/_recovery?pretty
curl -XPUT 'localhost:9200/.kibana/_settings' -d '
{
"index" : {
"number_of_replicas" : 0
}
}'
修改后还有报错的话，重启kibana.

哈哈！忘记重启elasticsearch,导致页面索引丢失，没有数据.

添加索引模板：
$curl -XPUT 'http://10.26.44.42:9200/_template/filebeat?pretty' -d@/root/filebeat.template.json

模板文件：
Vim /root/fillebeat.template.json
{
"mappings": {
"default": {
"_all": {
"enabled": true,
"norms": {
"enabled": false
}
},
"dynamic_templates": [
{
"template1": {
"mapping": {
"doc_values": true,
"ignore_above": 1024,
"index": "not_analyzed",
"type": "{dynamic_type}"
},
"match": ""
}
}
],
"properties": {
"geoip": {
"properties" : {
"location": {
"type": "geo_point"
},
"ip": { "type": "ip" },
"coordinates": { "type": "geo_point" }
}},
"@timestamp": {
"type": "date"
},
"message": {
"type": "string",
"index": "analyzed"
},
"offset": {
"type": "long",
"doc_values": "true"
}
}
}
},
"settings": {
"index.refresh_interval": "5s"
},
"template": "filebeat-"
}

查看集群的状态：
$ curl -XGET 'http://10.26.44.42:9200/_cluster/health?pretty=true'
{
"cluster_name" : "elks",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 3,
"number_of_data_nodes" : 3,
"active_primary_shards" : 5269,
"active_shards" : 6812,
"relocating_shards" : 0,
"initializing_shards" : 6,
"unassigned_shards" : 4151,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 5136,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 4711822,
"active_shards_percent_as_number" : 62.10228826693409
}

查看unassigned_shards:
$curl -s 'http://10.26.44.42:9200/_cat/shards' | grep UNASSIGNED | awk '{print $1}' | sort | uniq

elk集群存在问题：单节点删除过索引
将unassigned_shards删除后，重启elasticsearch，服务状态正常.

将unassigned_shards清除：
curl -XPUT 'localhost:9200/_all/_settings?pretty' -H 'Content-Type: application/json' -d'
{
"settings": {
"index.unassigned.node_left.delayed_timeout": "0"
}
}
'