airflow笔记

官网:http://airflow.incubator.apache.org/project.html

Here we pass a string that defines the dag_id, which serves as a unique identifier for your DAG.
The first argument task_id acts as a unique identifier for the task.

The precedence rules for a task are as follows:
Explicitly passed arguments
Values that exist in the default_args dictionary
The operator’s default value, if one exists

A task must include or inherit the arguments task_id and owner

Let’s assume we’re saving the code from the previous step in tutorial.py in the DAGs folder referenced in your airflow.cfg.
The default location for your DAGs is ~/airflow/dags.

Note that if you use depends_on_past=True, individual task instances will depend on the success of the preceding task instance, except for the start_date specified itself, for which this dependency is disregarded.

You can also set options with environment variables by using this format: $AIRFLOW__{SECTION}__{KEY}
================================
# print the list of active DAGs
airflow list_dags

# prints the list of tasks the "tutorial" dag_id
airflow list_tasks tutorial

airflow backfill :airflow backfill tutorial -s 2015-06-01 -e 2015-06-07
airflow test :It simply allows testing a single task instance.
airflow webserver :will start a web server

===============

1  “LocalExecutor” :an executor that can parallelize task instances locally.

2 配置文件所在路径:$AIRFLOW_HOME/airflow.cfg,配置文件中的sql_alchemy_conn 指向源数据数据库的地址

3 AIRFLOW_HOME 的默认值:~/airflow

4 Admin->Connection : The pipeline code you will author will reference the ‘conn_id’ of the Connection objects

5 环境变量里的值的优先级高于配置文件中对应的值

6 连接的环境变量必须有前缀AIRFLOW_CONN_,环境变量必须是全大写,if the conn_id is named postgres_master the environment variable should be named AIRFLOW_CONN_POSTGRES_MASTER

代表连接的环境变量的返回值应该是URI格式,如postgres://user:password@localhost:5432/master or s3://accesskey:secretkey@S3

7 Users can specify a logs folder in airflow.cfg. By default, it is in the AIRFLOW_HOME directory. 

  Logs are stored in the log folder as {dag_id}/{task_id}/{execution_date}/{try_number}.log.

8 operator :The airflow/contrib/ directory contains yet more operators built by the community

9 a) SubDAG operators should contain a factory method that returns a DAG object.

 b)SubDAGs must have a schedule and be enabled. 

c ) refrain from using depends_on_past=True in tasks within the SubDAG as this can be confusing

 d) It is common to use the SequentialExecutor if you want to run the SubDAG in-process and effectively limit its parallelism to one. Using LocalExecutor can be problematic

10

 

11 if you run a DAG on a schedule_interval of one day, the run stamped 2016-01-01 will be trigger soon after 2016-01-01T23:59

12 The scheduler starts an instance of the executor specified in the your airflow.cfg. If it happens to be the LocalExecutor, tasks will be executed as subprocesses; in the case of CeleryExecutor andMesosExecutor, tasks are executed remotely.

 13 Airflow 可以为任意一个 Task 指定一个抽象的 Pool,每个 Pool 可以指定一个 Slot 数。 每当一个 Task 启动时,就占用一个 Slot,当 Slot 数占满时,其余的任务就处于等待状态

14 上一轮的某个dag的处理时间可能很长,导致到下一轮处理的时候这个dag还没有处理完成。 Airflow 的处理逻辑是在这一轮不为这个dag创建进程,这样就不会阻塞进程去处理其余dag。

 15 A task must include or inherit the arguments task_id and owner, otherwise Airflow will raise an exception.task支持的详细参数可以看下BaseOperator的构造方法

16 通过retry_exponential_backoff实现重试间隔越来越长
通过wait_for_downstream实现上次dag没执行完,则这次不执行
通过weight_rule计算每个task的优先级
通过execution_timeout控制task的超时时间
通过trigger_rule控制task执行的触发条件
通过task_concurrency 控制同一个task可以并行执行的个数

17 airflow test不判断任务的依赖关系,直接执行

18 airflow偶尔占用内存太高问题定位:因为历史task挂死,一直没执行完,是running状态,随着时间的积累,
导致处于running和queue状态的任务大于concurrency了,后续生成的taskinstance都是scheduled状态。而每次scheduler每次调度任务时,都会取出scheduled状态的任务,进行排序等操作,因为scheduled状态的任务太多,所以占用了很大内存

19 模板参数:

{
            'dag': task.dag,
            'ds': ds,
            'ds_nodash': ds_nodash,
            'ts': ts,
            'ts_nodash': ts_nodash,
            'yesterday_ds': yesterday_ds,
            'yesterday_ds_nodash': yesterday_ds_nodash,
            'tomorrow_ds': tomorrow_ds,
            'tomorrow_ds_nodash': tomorrow_ds_nodash,
            'END_DATE': ds,
            'end_date': ds,
            'dag_run': dag_run,
            'run_id': run_id,
            'execution_date': self.execution_date,
            'prev_execution_date': prev_execution_date,
            'next_execution_date': next_execution_date,
            'latest_date': ds,
            'macros': macros,
            'params': params,
            'tables': tables,
            'task': task,
            'task_instance': self,
            'ti': self,
            'task_instance_key_str': ti_key_str,
            'conf': configuration,
            'test_mode': self.test_mode,
            'var': {
                'value': VariableAccessor(),
                'json': VariableJsonAccessor()
            }
        }

20

原文地址:https://www.cnblogs.com/testzcy/p/8480036.html