1.Model level
###1. DataStream API
use Data Source
environment.fromSource(
Source<OUT, ?, ?> source,
WatermarkStrategy<OUT> timestampsAndWatermarks,
String sourceName)
StreamExecutionEnvironment.addSource(sourceFunction).
###2.DataSet API
DataSet Transformations
###3.Table API & SQL
使用Java开发 依赖
flink-table-common
flink-table-api-java-bridge
flink-table-planner-blink
flink-table-runtime-blink
引入:
org.apache.flink.table.api.Table;
org.apache.flink.table.api.bridge.java.StreamTableEnvironment;
org.apache.flink.table.api.bridge.java.BatchTableEnvironment
Flink 1.11 引入了新的 Table Source 和 Sink 接口(即 DynamicTableSource 和 DynamicTableSink ),
org.apache.flink.table.connector.source
org.apache.flink.table.connector.sink
这一接口可以统一批作业和流作业
2.Data Types
Supported Data Types
Type handling
Creating a TypeInformation or TypeSerializer
Data Types in the Table API
org.apache.flink.table.types.DataType within the Table API
or when defining connectors, catalogs,
or user-defined functions.
3.Connector
从数据讲,有三类connector
DataStream Connectors
DataSet Connectors
Table & SQL Connectors
作用:
01.DataStream Connectors
Predefined Sources and Sinks
Bundled Connectors
Connectors in Apache Bahir
Other Ways to Connect to Flink
Data Enrichment via Async I/O
Queryable State
02.DataSet Connectors
file systems
other systems using Input/OutputFormat wrappers for Hadoop
03.Table & SQL Connectors : register table sources and table sinks
Flink’s table connectors.
User-defined Sources & Sinks == develop a custom, user-defined connector.
Metadata Planning Runtime
实现:
Dynamic Table Source Dynamic Table Factories
Dynamic Table Sink Encoding / Decoding Formats
Predefined Sources and Sinks
1.pre-defined source connectors
自定义的Source SourceOperators
flink-core
org.apache.flink.api.connector.source.SourceSplit
org.apache.flink.api.connector.source.SourceReader
org.apache.flink.api.connector.source.SplitEnumerator
org.apache.flink.api.connector.source.event.NoMoreSplitsEvent
自定义一个新的 数据源或者理解Fink的数据源的原理
Sources and sinks are often summarized under the term connector.
4.Refactor Source Interface
. Data Source API
Flink提供的Source - Data Source API
01. A Data Source has three core components:
Splits , the SplitEnumerator, and the SourceReader.
在有界或者批处理的情况下,
the enumerator generates a fix set of splits, and each split is necessarily finite.
读取完成后,会返回 NoMoreSplits ,即 有限的splits,且每一个 split是有界的
在无界的流处理情况下
one of the two is not true (splits are not finite, or the enumerator keep generating new splits).
例如:
Bounded File Source
Unbounded Streaming File Source
SplitEnumerator 不对 NoMoreSplits做回应,且周期的查看内容
02.The Source API is 工厂模式的接口来创建以下组件
Split Serializer
Split Enumerator
Enumerator Checkpoint Serializer
Source Reader 消费来自Split的消息
03.
The SplitReader is the high-level API for simple synchronous reading/polling-based source implementations,
SourceReaderBase
SplitFetcherManager
数据源的Event Time and Watermarks ,不要使用老的函数了,因为数据源已经assigned
2. Data Source Function
01.预定义的 Source 和 Sink,
(内置在 Flink 里 直接使用,一般用于调试验证等,不需要引入外部依赖)
pre-implemented source functions,
File-based
Socket-based
Collection-based
02.Connectors provide code for interfacing with various third-party systems
连接器可以和多种多样的第三方系统进行交互
001.Flink里已经提供了一些绑定的Connector(需要 将相应的connetor相关类打包进)
public abstract class KafkaDynamicSinkBase implements DynamicTableSink
public interface ScanTableSource extends DynamicTableSource
org.apache.flink.table.connector.sink.DynamicTableSink
org.apache.flink.table.connector.source.DynamicTableSource
002.Apache Bahir中的连接器
03.Flink 提供了异步 I/O API 连接Fink,一般用于访问外部数据库
异步I/O可以并发处理多个请求,提高吞吐,减少延迟
04.可查询状态