Kafka Streams开发入门(4)

背景

上一篇演示了filter操作算子的用法。今天展示一下如何根据不同的条件谓词（Predicate）将一个消息流实时地进行分流，划分成多个新的消息流，即所谓的流split。有的时候我们想要对消息流中的不同消息类型进行不同的处理逻辑，此时流split功能就显得非常的实用。

演示功能说明

今天依然使用表征一个电影的消息类型，格式如下：

{"name": "Meryl Streep", "title": "The Iron Lady", "genre": "drama"}

{"name": "Will Smith", "title": "Men in Black", "genre": "comedy"}

{"name": "Matt Damon", "title": "The Martian", "genre": "drama"}

{"name": "Judy Garland", "title": "The Wizard of Oz", "genre": "fantasy"}

name是主演，title是影片名，genre是影片类型。我们今天使用Kafka Streams来演示将不同genre类型的影片split到不同的消息流中。

值得一提的是，我们今天依然使用protocol buffer对消息进行序列化和反序列化。

初始化项目

第一步是对项目进行初始化，我们首先创建项目目录：

$ mkdir split-stream

$ cd split-stream

配置项目

接下来是创建Gradle项目配置文件build.gradle：

 buildscript {

    repositories {
        jcenter()
    }
    dependencies {
        classpath 'com.github.jengelman.gradle.plugins:shadow:4.0.2'
    }
}

plugins {
    id 'java'
    id "com.google.protobuf" version "0.8.10"
}
apply plugin: 'com.github.johnrengelman.shadow'


repositories {
    mavenCentral()
    jcenter()

    maven {
        url 'http://packages.confluent.io/maven'
    }
}

group 'huxihx.kafkastreams'

sourceCompatibility = 1.8
targetCompatibility = '1.8'
version = '0.0.1'

dependencies {
    implementation 'com.google.protobuf:protobuf-java:3.0.0'
    implementation 'org.slf4j:slf4j-simple:1.7.26'
    implementation 'org.apache.kafka:kafka-streams:2.3.0'
    implementation 'com.google.protobuf:protobuf-java:3.9.1'

    testCompile group: 'junit', name: 'junit', version: '4.12'
}

protobuf {
    generatedFilesBaseDir = "$projectDir/src/"
    protoc {
        artifact = 'com.google.protobuf:protoc:3.0.0'
    }
}

jar {
    manifest {
        attributes(
                'Class-Path': configurations.compile.collect { it.getName() }.join(' '),
                'Main-Class': 'huxihx.kafkastreams.SplitMovieStreamApp'
        )
    }
}

shadowJar {
    archiveName = "kstreams-transform-standalone-${version}.${extension}"
}

保存上面的文件，然后执行下列命令下载Gradle的wrapper套件：　

$ gradle wrapper

之后在split-streams目录下创建一个名为configuration的文件夹用于保存我们的参数配置文件dev.properties：

$ mkdir configuration

$ vi configuration/dev.properties

application.id=splitting-app
bootstrap.servers=localhost:9092

input.topic.name=acting-events
input.topic.partitions=1
input.topic.replication.factor=1

output.drama.topic.name=drama-acting-events
output.drama.topic.partitions=1
output.drama.topic.replication.factor=1

output.fantasy.topic.name=fantasy-acting-events
output.fantasy.topic.partitions=1
output.fantasy.topic.replication.factor=1

output.other.topic.name=other-acting-events
output.other.topic.partitions=1
output.other.topic.replication.factor=1

该文件设置了我们要连接的Kafka集群信息，以及输入输出topic的详细信息。

创建消息Schema

接下来创建用到的topic的schema。在split-streams下执行命令创建保存schema的文件夹：

$ mkdir -p src/main/proto

之后在proto文件夹下创建名为acting.proto文件，内容如下：

syntax = "proto3";

package huxihx.kafkastreams.proto;

message Acting {
    string name = 1;
    string title = 2;
    string genre = 3;
}

保存之后在split-stream下运行gradle命令：

$ ./gradlew build

此时，你应该可以在src/main/java/huxihx/kafkastreams/proto下看到生成的Java类：ActingOuterClass。

创建Serdes

这一步我们为所需的topic消息创建各自的Serdes。首先执行下面的命令创建对应的文件夹目录：

mkdir -p src/main/java/huxihx/kafkastreams/serdes

之后在新创建的serdes文件夹下创建ProtobufSerializer.java：

package huxihx.kafkastreams.serdes;

import com.google.protobuf.MessageLite;
import org.apache.kafka.common.serialization.Serializer;

public class ProtobufSerializer<T extends MessageLite> implements Serializer<T> {
    @Override
    public byte[] serialize(String topic, T data) {
        return data == null ? new byte[0] : data.toByteArray();
    }
}

然后是ProtobufDeserializer.java：

package huxihx.kafkastreams.serdes;

import com.google.protobuf.InvalidProtocolBufferException;
import com.google.protobuf.MessageLite;
import com.google.protobuf.Parser;
import org.apache.kafka.common.errors.SerializationException;
import org.apache.kafka.common.serialization.Deserializer;

import java.util.Map;

public class ProtobufDeserializer<T extends MessageLite> implements Deserializer<T> {

    private Parser<T> parser;

    @Override
    public void configure(Map<String, ?> configs, boolean isKey) {
        parser = (Parser<T>) configs.get("parser");
    }

    @Override
    public T deserialize(String topic, byte[] data) {
        try {
            return parser.parseFrom(data);
        } catch (InvalidProtocolBufferException e) {
            throw new SerializationException("Failed to deserialize from a protobuf byte array.", e);
        }
    }
}

最后是ProtobufSerdes.java：

package huxihx.kafkastreams.serdes;

import com.google.protobuf.MessageLite;
import com.google.protobuf.Parser;
import org.apache.kafka.common.serialization.Deserializer;
import org.apache.kafka.common.serialization.Serde;
import org.apache.kafka.common.serialization.Serializer;

import java.util.HashMap;
import java.util.Map;

public class ProtobufSerdes<T extends MessageLite> implements Serde<T> {

    private final Serializer<T> serializer;
    private final Deserializer<T> deserializer;

    public ProtobufSerdes(Parser<T> parser) {
        serializer = new ProtobufSerializer<>();
        deserializer = new ProtobufDeserializer<>();
        Map<String, Parser<T>> config = new HashMap<>();
        config.put("parser", parser);
        deserializer.configure(config, false);
    }

    @Override
    public Serializer<T> serializer() {
        return serializer;
    }

    @Override
    public Deserializer<T> deserializer() {
        return deserializer;
    }
}

开发主流程

在src/main/java/huxihx/kafkastreams下创建SplitMovieStreamApp.java文件：

package huxihx.kafkastreams;

import huxihx.kafkastreams.proto.ActingOuterClass;
import huxihx.kafkastreams.serdes.ProtobufSerdes;
import org.apache.kafka.clients.admin.AdminClient;
import org.apache.kafka.clients.admin.AdminClientConfig;
import org.apache.kafka.clients.admin.NewTopic;
import org.apache.kafka.clients.admin.TopicListing;
import org.apache.kafka.common.serialization.Serdes;
import org.apache.kafka.streams.KafkaStreams;
import org.apache.kafka.streams.StreamsBuilder;
import org.apache.kafka.streams.StreamsConfig;
import org.apache.kafka.streams.Topology;
import org.apache.kafka.streams.kstream.Consumed;
import org.apache.kafka.streams.kstream.KStream;

import java.io.FileInputStream;
import java.io.IOException;
import java.util.ArrayList;
import java.util.Collection;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.Properties;
import java.util.concurrent.CountDownLatch;
import java.util.stream.Collectors;

public class SplitMovieStreamApp {

    public static void main(String[] args) throws Exception {
        if (args.length < 1) {
            throw new IllegalArgumentException("Config file path must be specified.");
        }

        SplitMovieStreamApp app = new SplitMovieStreamApp();
        Properties envProps = app.loadEnvProperties(args[0]);
        Properties streamProps = app.createStreamsProperties(envProps);
        Topology topology = app.buildTopology(envProps);

        app.preCreateTopics(envProps);

        final KafkaStreams streams = new KafkaStreams(topology, streamProps);
        final CountDownLatch latch = new CountDownLatch(1);

        Runtime.getRuntime().addShutdownHook(new Thread("streams-shutdown-hook") {
            @Override
            public void run() {
                streams.close();
                latch.countDown();
            }
        });

        try {
            streams.start();
            latch.await();
        } catch (Exception e) {
            System.exit(1);
        }
        System.exit(0);
    }

    private Topology buildTopology(Properties envProps) {
        final StreamsBuilder builder = new StreamsBuilder();
        final String inputTopic = envProps.getProperty("input.topic.name");

        KStream<String, ActingOuterClass.Acting>[] branches = builder
                .stream(inputTopic, Consumed.with(Serdes.String(), actingProtobufSerdes()))
                .branch((key, value) -> "drama".equalsIgnoreCase(value.getGenre()),
                        (key, value) -> "fantasy".equalsIgnoreCase(value.getGenre()),
                        (key, value) -> true);
        branches[0].to(envProps.getProperty("output.drama.topic.name"));
        branches[1].to(envProps.getProperty("output.fantasy.topic.name"));
        branches[2].to(envProps.getProperty("output.other.topic.name"));

        return builder.build();
    }

    /**
     * 为Kafka Streams程序构建所需的Properties实例
     *
     * @param envProps
     * @return
     */
    private Properties createStreamsProperties(Properties envProps) {
        Properties props = new Properties();
        props.put(StreamsConfig.APPLICATION_ID_CONFIG, envProps.getProperty("application.id"));
        props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, envProps.getProperty("bootstrap.servers"));
        props.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass());
        props.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, Serdes.String().getClass());
        return props;
    }

    /**
     * 预创建输入/输出topic，如果topic已存在则忽略
     *
     * @param envProps
     * @throws Exception
     */
    private void preCreateTopics(Properties envProps) throws Exception {
        Map<String, Object> config = new HashMap<>();
        config.put(AdminClientConfig.BOOTSTRAP_SERVERS_CONFIG, envProps.getProperty("bootstrap.servers"));
        String inputTopic = envProps.getProperty("input.topic.name");
        String outputTopic1 = envProps.getProperty("output.drama.topic.name");
        String outputTopic2 = envProps.getProperty("output.fantasy.topic.name");
        String outputTopic3 = envProps.getProperty("output.other.topic.name");
        try (AdminClient client = AdminClient.create(config)) {
            Collection<TopicListing> existingTopics = client.listTopics().listings().get();

            List<NewTopic> topics = new ArrayList<>();
            List<String> topicNames = existingTopics.stream().map(TopicListing::name).collect(Collectors.toList());
            if (!topicNames.contains(inputTopic))
                topics.add(new NewTopic(
                        envProps.getProperty("input.topic.name"),
                        Integer.parseInt(envProps.getProperty("input.topic.partitions")),
                        Short.parseShort(envProps.getProperty("input.topic.replication.factor"))));

            if (!topicNames.contains(outputTopic1))
                topics.add(new NewTopic(
                        envProps.getProperty("output.drama.topic.name"),
                        Integer.parseInt(envProps.getProperty("output.drama.topic.partitions")),
                        Short.parseShort(envProps.getProperty("output.drama.topic.replication.factor"))));

            if (!topicNames.contains(outputTopic2))
                topics.add(new NewTopic(
                        envProps.getProperty("output.fantasy.topic.name"),
                        Integer.parseInt(envProps.getProperty("output.fantasy.topic.partitions")),
                        Short.parseShort(envProps.getProperty("output.fantasy.topic.replication.factor"))));

            if (!topicNames.contains(outputTopic3))
                topics.add(new NewTopic(
                        envProps.getProperty("output.other.topic.name"),
                        Integer.parseInt(envProps.getProperty("output.other.topic.partitions")),
                        Short.parseShort(envProps.getProperty("output.other.topic.replication.factor"))));

            if (!topics.isEmpty())
                client.createTopics(topics).all().get();
        }
    }

    /**
     * 加载configuration下的配置文件
     *
     * @param fileName
     * @return
     * @throws IOException
     */
    private Properties loadEnvProperties(String fileName) throws IOException {
        Properties envProps = new Properties();
        try (FileInputStream input = new FileInputStream(fileName)) {
            envProps.load(input);
        }
        return envProps;
    }

    /**
     * 构建topic所需的Serdes
     *
     * @return
     */
    private static ProtobufSerdes<ActingOuterClass.Acting> actingProtobufSerdes() {
        return new ProtobufSerdes<>(ActingOuterClass.Acting.parser());
    }
}

主要的逻辑在buildTopology方法中，我们调用KStream的branch方法将输入消息流按照不同的genre分成了3个子消息流。

编写测试Producer和Consumer

和之前的入门系列一样，我们编写TestProducer和TestConsumer类。在src/main/java/huxihx/kafkastreams/tests/TestProducer.java和TestConsumer.java，内容分别如下：

TestProducer.java:

package huxihx.kafkastreams.tests;

import huxihx.kafkastreams.proto.ActingOuterClass;
import huxihx.kafkastreams.serdes.ProtobufSerializer;
import org.apache.kafka.clients.producer.KafkaProducer;
import org.apache.kafka.clients.producer.Producer;
import org.apache.kafka.clients.producer.ProducerConfig;
import org.apache.kafka.clients.producer.ProducerRecord;

import java.util.Arrays;
import java.util.List;
import java.util.Properties;

public class TestProducer {
    // 测试输入消息
    private static final List<ActingOuterClass.Acting> TEST_ACTING_EVENTS = Arrays.asList(
            ActingOuterClass.Acting.newBuilder().setName("Meryl Streep").setTitle("The Iron Lady").setGenre("drama").build(),
            ActingOuterClass.Acting.newBuilder().setName("Will Smith").setTitle("Men in Black").setGenre("comedy").build(),
            ActingOuterClass.Acting.newBuilder().setName("Matt Damon").setTitle("The Martian").setGenre("drama").build(),
            ActingOuterClass.Acting.newBuilder().setName("Judy Garlandp").setTitle("The Wizard of Oz").setGenre("fantasy").build(),
            ActingOuterClass.Acting.newBuilder().setName("Jennifer Aniston").setTitle("Office Space").setGenre("comedy").build(),
            ActingOuterClass.Acting.newBuilder().setName("Bill Murray").setTitle("Ghostbusters").setGenre("fantasy").build(),
            ActingOuterClass.Acting.newBuilder().setName("Christian Bale").setTitle("The Dark Knight").setGenre("crime").build(),
            ActingOuterClass.Acting.newBuilder().setName("Laura Dern").setTitle("Jurassic Park").setGenre("fantasy").build(),
            ActingOuterClass.Acting.newBuilder().setName("Keanu Reeves").setTitle("The Matrix").setGenre("fantasy").build(),
            ActingOuterClass.Acting.newBuilder().setName("Russell Crowe").setTitle("Gladiator").setGenre("drama").build(),
            ActingOuterClass.Acting.newBuilder().setName("Diane Keaton").setTitle("The Godfather: Part II").setGenre("crime").build()
    );

    public static void main(String[] args) {
        Properties props = new Properties();
        props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
        props.put(ProducerConfig.ACKS_CONFIG, "all");
        props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer");
        props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, new ProtobufSerializer<ActingOuterClass.Acting>().getClass());

        try (final Producer<String, ActingOuterClass.Acting> producer = new KafkaProducer<>(props)) {
            TEST_ACTING_EVENTS.stream().map(acting -> new ProducerRecord<String, ActingOuterClass.Acting>("acting-events", acting))
                    .forEach(producer::send);
        }

    }
}

TestConsumer.java：

package huxihx.kafkastreams.tests;

import com.google.protobuf.Parser;
import huxihx.kafkastreams.proto.ActingOuterClass;
import huxihx.kafkastreams.serdes.ProtobufDeserializer;
import org.apache.kafka.clients.consumer.Consumer;
import org.apache.kafka.clients.consumer.ConsumerConfig;
import org.apache.kafka.clients.consumer.ConsumerRecord;
import org.apache.kafka.clients.consumer.ConsumerRecords;
import org.apache.kafka.clients.consumer.KafkaConsumer;
import org.apache.kafka.common.serialization.Deserializer;
import org.apache.kafka.common.serialization.StringDeserializer;

import java.time.Duration;
import java.util.Arrays;
import java.util.HashMap;
import java.util.Map;
import java.util.Properties;

public class TestConsumer {

    public static void main(String[] args) {
        if (args.length < 1) {
            throw new IllegalStateException("Must specify an output topic name.");
        }

        Deserializer<ActingOuterClass.Acting> deserializer = new ProtobufDeserializer<>();
        Map<String, Parser<ActingOuterClass.Acting>> config = new HashMap<>();
        config.put("parser", ActingOuterClass.Acting.parser());
        deserializer.configure(config, false);

        Properties props = new Properties();
        props.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
        props.put(ConsumerConfig.GROUP_ID_CONFIG, "test-group");
        props.put(ConsumerConfig.AUTO_COMMIT_INTERVAL_MS_CONFIG, "1000");
        props.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");

        try (final Consumer<String, ActingOuterClass.Acting> consumer = new KafkaConsumer<>(props, new StringDeserializer(), deserializer)) {
            consumer.subscribe(Arrays.asList(args[0]));
            while (true) {
                ConsumerRecords<String, ActingOuterClass.Acting> records = consumer.poll(Duration.ofMillis(500));
                for (ConsumerRecord<String, ActingOuterClass.Acting> record : records) {
                    System.out.printf("offset = %d, key = %s, value = %s%n", record.offset(), record.key(), record.value());
                }
            }
        }
    }
}

测试

首先我们运行下列命令构建项目：

$ ./gradlew shadowJar

然后启动Kafka集群，之后运行Kafka Streams应用：

$ java -jar build/libs/kstreams-transform-standalone-0.0.1.jar configuration/dev.properties

现在启动TestProducer发送测试事件：

$ java -cp build/libs/kstreams-transform-standalone-0.0.1.jar huxihx.kafkastreams.tests.TestProducer

最后启动TestConsumer验证Kafka Streams将输入消息流划分成了3个子消息流：

$ java -cp build/libs/kstreams-transform-standalone-0.0.1.jar huxihx.kafkastreams.tests.TestConsumer drama-acting-events
......
offset = 0, key = null, value = name: "Meryl Streep"
title: "The Iron Lady"
genre: "drama"

offset = 1, key = null, value = name: "Matt Damon"
title: "The Martian"
genre: "drama"

offset = 2, key = null, value = name: "Russell Crowe"
title: "Gladiator"
genre: "drama"

$ java -cp build/libs/kstreams-transform-standalone-0.0.1.jar huxihx.kafkastreams.tests.TestConsumer fantasy-acting-events
......
offset = 0, key = null, value = name: "Judy Garlandp"
title: "The Wizard of Oz"
genre: "fantasy"

offset = 1, key = null, value = name: "Bill Murray"
title: "Ghostbusters"
genre: "fantasy"

offset = 2, key = null, value = name: "Laura Dern"
title: "Jurassic Park"
genre: "fantasy"

offset = 3, key = null, value = name: "Keanu Reeves"
title: "The Matrix"
genre: "fantasy"

$ java -cp build/libs/kstreams-transform-standalone-0.0.1.jar huxihx.kafkastreams.tests.TestConsumer other-acting-events
......
offset = 0, key = null, value = name: "Will Smith"
title: "Men in Black"
genre: "comedy"

offset = 1, key = null, value = name: "Jennifer Aniston"
title: "Office Space"
genre: "comedy"

offset = 2, key = null, value = name: "Christian Bale"
title: "The Dark Knight"
genre: "crime"

offset = 3, key = null, value = name: "Diane Keaton"
title: "The Godfather: Part II"
genre: "crime"

总结　　

本篇演示了Kafka Streams使用KStream提供的branch方法将输入消息流分隔成多股消息流的案例。划分子消息流能够允许用户在后面对不同的消息流进行不同的计算逻辑。下一篇我们演示与branch相对应的操作算子：合并消息流（merge)。