Apache NiFi

What?

https://www.tutorialspoint.com/apache_nifi/index.htm

一个开源的数据萃取平台。

Apache NiFi is an open source data ingestion platform. It was developed by NSA and is now being maintained and further development is supported by Apache foundation. It is based on Java, and runs in Jetty server. It is licensed under the Apache license version 2.0. In this tutorial, we will be explaining the basics of Apache NiFi and its features.

http://nifi.apache.org/docs.html

Put simply, NiFi was built to automate the flow of data between systems. While the term 'dataflow' is used in a variety of contexts, we use it here to mean the automated and managed flow of information between systems. This problem space has been around ever since enterprises had more than one system, where some of the systems created data and some of the systems consumed data. The problems and solution patterns that emerged have been discussed and articulated extensively. A comprehensive and readily consumed form is found in the Enterprise Integration Patterns [eip].

The core concepts of NiFi

http://nifi.apache.org/docs.html

NiFi’s fundamental design concepts closely relate to the main ideas of Flow Based Programming [fbp]. Here are some of the main NiFi concepts and how they map to FBP:

NiFi Term FBP Term Description

FlowFile

Information Packet

A FlowFile represents each object moving through the system and for each one, NiFi keeps track of a map of key/value pair attribute strings and its associated content of zero or more bytes.

FlowFile Processor

Black Box

Processors actually perform the work. In [eip] terms a processor is doing some combination of data routing, transformation, or mediation between systems. Processors have access to attributes of a given FlowFile and its content stream. Processors can operate on zero or more FlowFiles in a given unit of work and either commit that work or rollback.

Connection

Bounded Buffer

Connections provide the actual linkage between processors. These act as queues and allow various processes to interact at differing rates. These queues can be prioritized dynamically and can have upper bounds on load, which enable back pressure.

Flow Controller

Scheduler

The Flow Controller maintains the knowledge of how processes connect and manages the threads and allocations thereof which all processes use. The Flow Controller acts as the broker facilitating the exchange of FlowFiles between processors.

Process Group

subnet

A Process Group is a specific set of processes and their connections, which can receive data via input ports and send data out via output ports. In this manner, process groups allow creation of entirely new components simply by composition of other components.

NiFi Term	FBP Term	Description
FlowFile	Information Packet	A FlowFile represents each object moving through the system and for each one, NiFi keeps track of a map of key/value pair attribute strings and its associated content of zero or more bytes.
FlowFile Processor	Black Box	Processors actually perform the work. In [eip] terms a processor is doing some combination of data routing, transformation, or mediation between systems. Processors have access to attributes of a given FlowFile and its content stream. Processors can operate on zero or more FlowFiles in a given unit of work and either commit that work or rollback.
Connection	Bounded Buffer	Connections provide the actual linkage between processors. These act as queues and allow various processes to interact at differing rates. These queues can be prioritized dynamically and can have upper bounds on load, which enable back pressure.
Flow Controller	Scheduler	The Flow Controller maintains the knowledge of how processes connect and manages the threads and allocations thereof which all processes use. The Flow Controller acts as the broker facilitating the exchange of FlowFiles between processors.
Process Group	subnet	A Process Group is a specific set of processes and their connections, which can receive data via input ports and send data out via output ports. In this manner, process groups allow creation of entirely new components simply by composition of other components.

架构 - 支持集群

NiFi Architecture

NiFi executes within a JVM on a host operating system. The primary components of NiFi on the JVM are as follows:

Web Server

The purpose of the web server is to host NiFi’s HTTP-based command and control API.

Flow Controller

The flow controller is the brains of the operation. It provides threads for extensions to run on, and manages the schedule of when extensions receive resources to execute.

Extensions

There are various types of NiFi extensions which are described in other documents. The key point here is that extensions operate and execute within the JVM.

FlowFile Repository

The FlowFile Repository is where NiFi keeps track of the state of what it knows about a given FlowFile that is presently active in the flow. The implementation of the repository is pluggable. The default approach is a persistent Write-Ahead Log located on a specified disk partition.

Content Repository

The Content Repository is where the actual content bytes of a given FlowFile live. The implementation of the repository is pluggable. The default approach is a fairly simple mechanism, which stores blocks of data in the file system. More than one file system storage location can be specified so as to get different physical partitions engaged to reduce contention on any single volume.

Provenance Repository

The Provenance Repository is where all provenance event data is stored. The repository construct is pluggable with the default implementation being to use one or more physical disk volumes. Within each location event data is indexed and searchable.

NiFi is also able to operate within a cluster.

Starting with the NiFi 1.0 release, a Zero-Master Clustering paradigm is employed. Each node in a N

Flow配置示例

https://github.com/xmlking/nifi-examples

csv-to-json

This flow shows how to convert a CSV entry to a JSON document using ExtractText and ReplaceText.

decompression

This flow demonstrates taking an archive that is created with several levels of compression and then continuously decompressing it using a loop until the archived file is extracted out.

Getting Started with Apache NiFi

https://nifi.apache.org/docs/nifi-docs/html/getting-started.html#downloading-and-installing-nifi

Apache NiFi User Guide

https://nifi.apache.org/docs/nifi-docs/html/user-guide.html

Tutorial

读取文件上传到mongo

https://dzone.com/articles/gentle-introduction-to-apache-nifi-for-dataflow-an

处理器

https://nifichina.github.io/general/GettingStarted.html#%E6%9C%89%E5%93%AA%E4%BA%9B%E7%B1%BB%E5%88%AB%E7%9A%84%E5%A4%84%E7%90%86%E5%99%A8

简单实战

https://www.cnblogs.com/h--d/p/10079418.html