Flume初步学习

一、Flume基础部分：

Flume -- 日志收集框架

产生背景：

日志分散到各个机器上，又想用大数据平台进行统计分析

从其他server把日志移动收集到集群上，并能够监控，需要有时效性、容错性、负载均衡

Flume 一般通过配置configuration file，来实现各种数据的收集

概述：

flume.apache.org

分布式、高可靠、高可用、高效、高扩展性

收集、聚合、移动大量的日志数据

webserver ==> flume ==> HDFS

基于流式数据的简洁框架，有容错机构

支持在线应用

只需要管理Agent的配置就行

同类框架对比：

Scribe：FaceBook C语言不再维护

Chukwa：Yahoo Java 不再维护

以上的负载特性都不好

Fluentd：Ruby开发

Logstash：ELK的其中一个组件，也用得比较多

Flume：由Cloudera/Apache开发 Java，用的多

一般用1.5版本以后的Flume NG

Flume的架构和组件：

Source收集、Channel聚集、Sink输出

官网上的Document里面有具体介绍

Channel：缓存池

多个写到一个：

一个写到多个：

二、Flume实战部分：

配置Flume：

conf目录下：

拷贝flume-env.sh.template

设置JAVA_HOME路径

启动bin目录下的flume-ng

使用Flume的关键是写agent配置文件

实战一：

从指定的网络端口采集数据输出到控制台

netcat source + memory channel + logger sink

配置agent文件：

# example.conf: A single-node Flume configuration

# Name the components on this agent

a1.sources = r1

a1.sinks = k1

a1.channels = c1

# Describe/configure the source

a1.sources.r1.type = netcat

a1.sources.r1.bind = localhost

a1.sources.r1.port = 44444

# Describe the sink

a1.sinks.k1.type = logger

# Use a channel which buffers events in memory

a1.channels.c1.type = memory

#a1.channels.c1.capacity = 1000

#a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel

a1.sources.r1.channels = c1

a1.sinks.k1.channel = c1

source可以对应channel，而sink只能对应一个channel

shell命令：

bin/flume-ng agent --conf conf --conf-file example.conf --name a1 -Dflume.root.logger=INFO,console

—conf 全局配置文件

--conf-file 单个agent配置文件

—name agent的名称

console 显示到控制台

开始以上脚本后，在另一个控制台：

使用telnet连接：

telnet localhost（主机名） 44444（端口号）

[INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:95)] Event: { headers:{} body: 6B 69 6E 67 68 65 79 0D kinghey. }

以上是接受到数据的控制台信息，event是flume传输的基本单元

Mac中control + C退出flume

实战二：

监控一个文件实时采集新增的数据输出到控制台

agent的选型：

exec source + memory channel + logger sink

按照官方文档改agent文件就行了

# Name the components on this agent

a1.sources = r1

a1.sinks = k1

a1.channels = c1

# Describe/configure the source

a1.sources.r1.type = exec

a1.sources.r1.command = tail -F /usr/local/mycode/data/data.log //监控的文件名

a1.sources.r1.shell = /bin/sh -c

# Describe the sink

a1.sinks.k1.type = logger

# Use a channel which buffers events in memory

a1.channels.c1.type = memory

# Bind the source and sink to the channel

a1.sources.r1.channels = c1

a1.sinks.k1.channel = c1

实战三：

将A服务器上的日志实时采集到B服务器

机器A： exec source + memory channel + avro sink

机器B： avro source + memory channel + logger sink

当需要两台机器进行通信时，一般用avro进行数据传输：

此实例就是把机器A的log文件的新增数据，通过avro方式传到机器B，并显示在控制台上

设置时，关键是把两边的主机名（hostname、bind）和端口（port）对应好

启动时，先启动接收端（机器B），才能开启端口，后启动机器A

接收时，只有机器B的控制台有显示，机器A只是放在了avro sink