structure streaming笔记

基于micro-batch, spark2.3之后, 支持continues processing
基于spark SQL
如同在静态table上运行标准批查询一样表现流计算, spark 通过在一个 unbound input table 上运行增量查询来实现.
unbound input table
- 　每条输入数据, 体现为表的一条新行
result table
- 　每批新输入被处理后, 更新此表. 三种mode:
- 　complete mode: 每次都更新全表
- append mode: result table只追加新行. 即新一批输入的处理结果不会依赖且不会影响之前的输出.
- update mode: 只有被新一批输入计算结果影响了的行, 才会被更新
event time
- 数据被输入的时间. 区别于spark收到数据的时间.
fault tolerant semantics
- 　end-to-end exactly-once
  - 　捕获failure并重试process
  - 　基于checkpointing 和 WAL - 断点接续
- 　区别与:
  - 　at-most once
    - 　至多写一次. 弱保证
基于DataSet和DataFrame的API