Flume 进阶

1.Flume事务

数据输入端 ： source分为两种：主动拉取：Spooldir Source  Taildir Source  被动推动：Exec source netcat source 
put事务阶段：
source 会采集到批量之后才会执行put事务流程，
do commit成功会会把数据提交到channel中，然后pulist中的数据被销毁
如果batch data 超过channel 的数据量 docommit 失败，
执行rollback回滚，putList中的数据被销毁，这时如果source类型为主动拉取时，重新拉取数据 数据不会丢失，
 如果source类型为被动推动时，重新不会拉取数据 数据丢失，所以要合理调整channel的存放数据的大小。
 take事务阶段:
 channel中的数据先进入takelist中，但是channel中还有数据，如果docommit成功的话，把takelist,channel中的数据销毁
  如果docommit失败的话，执行rollback,将takelist的数据销毁，channel中的数据还存在，继续拉取即可。

2.Flume Agent内部原理

1）ChannelSelector
  ChannelSelector的作用就是选出Event将要被发往哪个Channel。其共有两种类型，分别是Replicating（复制）和Multiplexing（多路复用）。
   ReplicatingSelector会将同一个Event发往所有的Channel，Multiplexing会根据相应的原则，将不同的Event发往不同的Channel。
 2）SinkProcessor
   SinkProcessor共有三种类型，分别是DefaultSinkProcessor、LoadBalancingSinkProcessor和FailoverSinkProcessor
   DefaultSinkProcessor对应的是单个的Sink，LoadBalancingSinkProcessor和FailoverSinkProcessor对应的是Sink Group，LoadBalancingSinkProcessor可以实现负载均衡的功能， 
   FailoverSinkProcessor可以错误恢复的功能。故障转移功能
   一个sink只能处理一个channel的数据 多对一

3.Flume拓扑结构

3.1 简单串联

   这种模式是将多个flume顺序连接起来了，从最初的source开始到最终sink传送的目的存储系统。此模式不建议桥接过多的flume数量， flume数量过多不仅会影响传输速率，而且一旦传输过程中某个节点flume宕机，会影响整个传输系统。

3.2 复制和多路复用

  Flume支持将事件流向一个或者多个目的地。这种模式可以将相同数据复制到多个channel中，或者将不同数据分发到不同的channel中，sink可以选择传送到不同的目的地。
   1)案例需求
     使用Flume-1监控文件变动，Flume-1将变动内容传递给Flume-2，Flume-2负责存储到HDFS。同时Flume-1将变动内容传递给Flume-3，Flume-3负责输出到Local FileSystem。

2)在/opt/module/flume/job目录下创建group1文件夹在group1中创建a1.conf a2.conf a3.conf 文件内容如下
a1.conf

# example.conf: A single-node Flume configuration

# Name the components on this agent
a1.sources = r1
a1.sinks = k1 k2
a1.channels = c1 c2

a1.sources.r1.selector.type = replicating
# Describe/configure the source
a1.sources.r1.type = TAILDIR
a1.sources.r1.positionFile = /opt/module/flume/taildir_position.json
a1.sources.r1.filegroups = f1
a1.sources.r1.filegroups.f1 = /opt/module/hive/logs/hive.log
# Describe the sink avro
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop103
a1.sinks.k1.port = 4545

a1.sinks.k2.type = avro
a1.sinks.k2.hostname = hadoop104
a1.sinks.k2.port = 4545

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

a1.channels.c2.type = memory
a1.channels.c2.capacity = 1000
a1.channels.c2.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1 c2
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c2

     a2.conf  记得事先要开启hdfs

# example.conf: A single-node Flume configuration

# Name the components on this agent
a2.sources = r1
a2.sinks = k1
a2.channels = c1

# Describe/configure the source  bind 监听本地客户端

a2.sources.r1.type = avro
a2.sources.r1.bind = 0.0.0.0
a2.sources.r1.port = 4545
# Describe the sink

a2.sinks.k1.type = hdfs

a2.sinks.k1.hdfs.path = /flume/group1/%y-%m-%d/%H%M
a2.sinks.k1.hdfs.filePrefix = events-%[localhost]
a2.sinks.k1.hdfs.rollInterval = 60
a2.sinks.k1.hdfs.rollSize = 134217728
a2.sinks.k1.hdfs.rollCount = 0
a2.sinks.k1.hdfs.fileType = DataStream
a2.sinks.k1.hdfs.round = true
a2.sinks.k1.hdfs.roundValue = 10
a2.sinks.k1.hdfs.roundUnit = minute
a2.sinks.k1.hdfs.useLocalTimeStamp = true
# Use a channel which buffers events in memory
a2.channels.c1.type = memory
a2.channels.c1.capacity = 1000
a2.channels.c1.transactionCapacity = 100


# Bind the source and sink to the channel
a2.sources.r1.channels = c1
a2.sinks.k1.channel = c1

       a3.conf


# example.conf: A single-node Flume configuration

# Name the components on this agent
a3.sources = r1
a3.sinks = k1
a3.channels = c1
 
# Describe/configure the source bind 监听本地客户端

a3.sources.r1.type = avro
a3.sources.r1.bind = 0.0.0.0
a3.sources.r1.port = 4545
# Describe the sink
a3.sinks.k1.type = file_roll
a3.sinks.k1.sink.directory = /opt/module/flume/a3
# Use a channel which buffers events in memory
a3.channels.c1.type = memory
a3.channels.c1.capacity = 1000
a3.channels.c1.transactionCapacity = 100


# Bind the source and sink to the channel
a3.sources.r1.channels = c1
a3.sinks.k1.channel = c1

    3）开启flume ，先开启后面的 最后开始前面的
      bin/flume-ng agent -n a2 -c conf -f job/group1/a2.conf
      bin/flume-ng agent -n a3 -c conf -f job/group1/a3.conf
      bin/flume-ng agent -n a1 -c conf -f job/group1/a1.conf
    4）观察hdfs 和本地/opt/module/flume/a3上的数据

3.3 负载均衡和故障转移

   你可以把多个sink分成一个组， 这时候 Sink组逻辑处理器 可以对这同一个组里的几个sink进行负载均衡或者其中一个sink发生故障后将输出Event的任务转移到其他的sink上
    记得一定要配置sinkgroup  

   1）案例需求
     使用Flume1监控一个端口，其sink组中的sink分别对接Flume2和Flume3，采用FailoverSinkProcessor，实现故障转移的功能。

   2)在/opt/module/flume/job目录下创建group2文件夹 ,在创建group2文件夹下 创建flume agent 配置文件 a1.conf a2.conf a3.conf
   a1.conf内容如下：

# example.conf: A single-node Flume configuration

# Name the components on this agent
a1.sources = r1
a1.sinks = k1 k2
a1.channels = c1

a1.sinkgroups = g1
a1.sinkgroups.g1.sinks = k1 k2
a1.sinkgroups.g1.processor.type = failover
a1.sinkgroups.g1.processor.priority.k1 = 5
a1.sinkgroups.g1.processor.priority.k2 = 10
a1.sinkgroups.g1.processor.maxpenalty = 10000

# Describe/configure the source  //0.0.0.0  监控本地 使用hostname的时候 仅限于本机 
a1.sources.r1.type = netcat
a1.sources.r1.bind = 0.0.0.0      
a1.sources.r1.port = 44444

# Describe the sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop103
a1.sinks.k1.port = 4545

a1.sinks.k2.type = avro
a1.sinks.k2.hostname = hadoop104
a1.sinks.k2.port = 4545
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c1

     a2.conf 内容如下

# example.conf: A single-node Flume configuration

# Name the components on this agent
a2.sources = r1
a2.sinks = k1
a2.channels = c1

# Describe/configure the source  bind 监听本地客户端

a2.sources.r1.type = avro
a2.sources.r1.bind = 0.0.0.0
a2.sources.r1.port = 4545
# Describe the sink

a2.sinks.k1.type = logger

# Use a channel which buffers events in memory
a2.channels.c1.type = memory
a2.channels.c1.capacity = 1000
a2.channels.c1.transactionCapacity = 100


# Bind the source and sink to the channel
a2.sources.r1.channels = c1
a2.sinks.k1.channel = c1

    a3.conf内容如下

# example.conf: A single-node Flume configuration

# Name the components on this agent
a3.sources = r1
a3.sinks = k1
a3.channels = c1

# Describe/configure the source  bind 监听本地客户端

a3.sources.r1.type = avro
a3.sources.r1.bind = 0.0.0.0
a3.sources.r1.port = 4545
# Describe the sink

a3.sinks.k1.type = logger

# Use a channel which buffers events in memory
a3.channels.c1.type = memory
a3.channels.c1.capacity = 1000
a3.channels.c1.transactionCapacity = 100


# Bind the source and sink to the channel
a3.sources.r1.channels = c1
a3.sinks.k1.channel = c1

       3)写完配置文件之后分发到各个集群中
       4）开启flume 先开启后面的 在开启前面的
        bin/flume-ng agent -n a2 -c conf -f job/group1/a2.conf
        bin/flume-ng agent -n a3 -c conf -f job/group1/a3.conf
        bin/flume-ng agent -n a1 -c conf -f job/group1/a1.conf
       5）nc localhost 44444  然后发送数据
       6）观察hadoop103 hadoop104上的数据变化
         因为a1.sinkgroups.g1.processor.priority.k1 = 5
            a1.sinkgroups.g1.processor.priority.k2 = 10
         所以k2上面的flume 不宕掉的话一直发送数据到k2 上，k2一旦宕掉的话，会发送数据到k1上的flume

3.4 聚合

这种模式是我们最常见的，也非常实用，日常web应用通常分布在上百个服务器，大者甚至上千个、上万个服务器。产生的日志，处理起来也非常麻烦。用flume的这种组合方式能很好的解决这一问题，每台服务器部署一个flume采集日志，传送到一个集中收集日志的flume，再由此flume上传到hdfs、hive、hbase等，进行日志分析。
1)案例需求
hadoop102上的Flume-1监控文件/opt/module/hive/logs/hive.log
hadoop103上的Flume-2监控某一个端口的数据流，
Flume-1与Flume-2将数据发送给hadoop104上的Flume-3，Flume-3将最终数据打印到控制台。
2）在/opt/module/flume/job目录下创建group3文件夹 ,在创建group3文件夹下创建flume agent 配置文件 a1.conf a2.conf a3.conf
a1.conf

# example.conf: A single-node Flume configuration

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = TAILDIR
a1.sources.r1.positionFile = /opt/module/flume/taildir_position.json
a1.sources.r1.filegroups = f1 
a1.sources.r1.filegroups.f1 = /opt/module/hive/logs/hive.log

# Describe the sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop104
a1.sinks.k1.port = 4545
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100


# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

      a2.conf

# example.conf: A single-node Flume configuration

# Name the components on this agent
a2.sources = r1
a2.sinks = k1
a2.channels = c1

# Describe/configure the source  //0.0.0.0  使用hostname的时候 仅限于本机 
a2.sources.r1.type = netcat	
a2.sources.r1.bind = 0.0.0.0      
a2.sources.r1.port = 44444

# Describe the sink
a2.sinks.k1.type = avro
a2.sinks.k1.hostname = hadoop104
a2.sinks.k1.port = 4545

# Use a channel which buffers events in memory
a2.channels.c1.type = memory
a2.channels.c1.capacity = 1000
a2.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a2.sources.r1.channels = c1
a2.sinks.k1.channel = c1

      a3.conf

# example.conf: A single-node Flume configuration

# Name the components on this agent
a3.sources = r1
a3.sinks = k1
a3.channels = c1

# Describe/configure the source  bind 监听本地客户端

a3.sources.r1.type = avro
a3.sources.r1.bind = 0.0.0.0
a3.sources.r1.port = 4545

# Describe the sink

a3.sinks.k1.type = logger
# Use a channel which buffers events in memory
a3.channels.c1.type = memory
a3.channels.c1.capacity = 1000
a3.channels.c1.transactionCapacity = 100


# Bind the source and sink to the channel
a3.sources.r1.channels = c1
a3.sinks.k1.channel = c1

      3)写完配置文件之后分发到各个集群中 (这一点要切记，后期修改后也要分发)
       4）开启flume 先开启后面的 在开启前面的
        bin/flume-ng agent -n a3 -c conf -f job/group1/a3.conf
        bin/flume-ng agent -n a2 -c conf -f job/group1/a2.conf
        bin/flume-ng agent -n a1 -c conf -f job/group1/a1.conf
       5）nc localhost 44444  然后发送数据
       6）观察hadoop103 hadoop104上的数据变化

3.5 自定义Interceptor

   1)需求 我们以端口数据模拟日志，以数字（单个）和字母（单个）模拟不同类型的日志，我们需要自定义interceptor区分数字和字母，将其分别发往不同的分析系统（Channel）。
   2）实现步骤
    （1）创建一个maven项目，并引入以下依赖。
      <dependency>
            <groupId>org.apache.flume</groupId>
            <artifactId>flume-ng-core</artifactId>
            <version>1.9.0</version>
      </dependency>
    （2）定义MyInterceptor类并实现Interceptor接口  最后打包到$flume/lib下

package com.atguigu.flume;

import org.apache.flume.Context;
import org.apache.flume.Event;
import org.apache.flume.interceptor.Interceptor;


import java.nio.charset.StandardCharsets;
import java.util.List;
import java.util.Map;

public class MyInterceptor implements Interceptor {
    public void initialize() {

    }

    public Event intercept(Event event) {
        byte[] body = event.getBody();
        Map<String, String> headers = event.getHeaders();
        //将body编码成string
        String line = new String(body, StandardCharsets.UTF_8);
        //获取第一个字母
        char c = line.charAt(0);


        if ((c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z')) {
            //以字母开头
            headers.put("leixing", "zimu");
        } else if (c >= '0' && c <= '9') {
            //以数字开头
            headers.put("leixing", "shuzi");
        } else {
            //其他情况，抛弃不要
            return null;
        }
        return event;
    }

    public List<Event> intercept(List<Event> list) {
        for (Event event : list) {
            intercept(event);
        }
        return list;
    }

    public void close() {

    }
    public static class Builder implements Interceptor.Builder{

        public Interceptor build() {
            return new MyInterceptor();
        }

        public void configure(Context context) {

        }
    }
}

    （3）编辑flume配置文件 在/opt/module/flume/job/group4  a1.conf a2.conf a3.conf
       a1.conf   记住a1.sources.r1.interceptors.i1.type = com.atguigu.flume.MyInterceptor$Builder 最后有个$符

# Name the components on this agent
a1.sources = r1
a1.sinks = k1 k2
a1.channels = c1 c2

# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444

a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = com.atguigu.flume.MyInterceptor$Builder

a1.sources.r1.selector.type = multiplexing
a1.sources.r1.selector.header = leixing
a1.sources.r1.selector.mapping.zimu = c1
a1.sources.r1.selector.mapping.shuzi = c2
# Describe the sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop103
a1.sinks.k1.port = 4141

a1.sinks.k2.type=avro
a1.sinks.k2.hostname = hadoop104
a1.sinks.k2.port = 4242

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Use a channel which buffers events in memory
a1.channels.c2.type = memory
a1.channels.c2.capacity = 1000
a1.channels.c2.transactionCapacity = 100


# Bind the source and sink to the channel
a1.sources.r1.channels = c1 c2
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c2

      a2.conf

a2.sources = r1
a2.sinks = k1
a2.channels = c1

a2.sources.r1.type = avro
a2.sources.r1.bind = hadoop103
a2.sources.r1.port = 4141

a2.sinks.k1.type = logger

a2.channels.c1.type = memory
a2.channels.c1.capacity = 1000
a2.channels.c1.transactionCapacity = 100

a2.sinks.k1.channel = c1
a2.sources.r1.channels = c1

      a3.conf

a3.sources = r1
a3.sinks = k1
a3.channels = c1

a3.sources.r1.type = avro
a3.sources.r1.bind = hadoop104
a3.sources.r1.port = 4242

a3.sinks.k1.type = logger

a3.channels.c1.type = memory
a3.channels.c1.capacity = 1000
a3.channels.c1.transactionCapacity = 100

a3.sinks.k1.channel = c1
a3.sources.r1.channels = c1

       （4）分发文件到各个机器中  然后开启flume
            bin/flume-ng agent -n a3 -c conf -f job/group4/a3.conf
            bin/flume-ng agent -n a2 -c conf -f job/group4/a2.conf
            bin/flume-ng agent -n a1 -c conf -f job/group4/a1.conf
        (5)观察数据的输出状况

3. Kafka Sink(将数据发往多topic)      将包含不同数据的分发的不同topic 中直接利用kafka sink 的特性  a1.sinks.k1.kafka.topic = other   默认分发到other中  其他的只需指定topic 
 
a1.sources = r1
a1.channels = c1
a1.sinks = k1 

a1.sources.r1.type = netcat
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = 6666

a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = com.atguigu.flume.interceptor.EventHeaderInterceptor$MyBuilder


a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
a1.channels.c1.transactionCapacity = 1000

a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.k1.kafka.bootstrap.servers = hadoop102:9092,hadoop103:9092,hadoop104:9092
a1.sinks.k1.kafka.topic = other
a1.sinks.k1.kafka.producer.acks = -1 
a1.sinks.k1.useFlumeEventFormat = false

a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

flume-ng agent -c $FLUME_HOME/conf -f $FLUME_HOME/jobs/kafkasink-topics.conf -n a1 -Dflume.root.logger=INFO,console

package com.atguigu.flume;

import org.apache.flume.Context;
import org.apache.flume.Event;
import org.apache.flume.interceptor.Interceptor;

import java.nio.charset.StandardCharsets;
import java.util.List;
import java.util.Map;

public class EventHeaderInterceptor implements Interceptor {
    public void initialize() {

    }

    public Event intercept(Event event) {
        byte[] body = event.getBody();
        Map<String, String> headers = event.getHeaders();
        //将body编码成string
        String line = new String(body, StandardCharsets.UTF_8);
        //判断body中是否含有"atguigu" "shangguigu"
        if(line.contains("atguigu")){
            headers.put("topic","atguigu");
        }else if(line.contains("shangguigu")){
            headers.put("topic","shangguigu");
        }
        return event;
    }

    public List<Event> intercept(List<Event> list) {
        for (Event event : list) {
            intercept(event);
        }
        return list;
    }

    public void close() {

    }
    public static class Builder implements Interceptor.Builder{

        public Interceptor build() {
            return new EventHeaderInterceptor();
        }

        public void configure(Context context) {

        }
    }
}