Purpose
To define a data flow or execution flow between tasks and/or subjobs.Format
/FLOW | task_or_subjob1 task_or_subjob2 [data_flow] |
where
data_flow | = | [attributes…] |
attributes | = | {dataflow_optimization} |
dataflow_optimization | = | {DIRECT [VERIFY]} {NOTDIRECT } |
Arguments
task_or_subjob1 | the pathname or alias of the first task or subjob in the execution flow. |
task_or_subjob2 | the pathname or alias of the second task or subjob in the execition flow. |
Location
The option may appear anywhere in the job definition. When using direct data flows, however, specify the most data intensive flows first to avoid ambiguity.
Notes
Data flows are established automatically at run-time between connected tasks. This is done by matching the fully qualified names of sources and targets and adding a dataflow link from the target of the first task to the matching source of the second task.
If one of the two files specified with the /FLOW option is a subjob, no data flow is established.
Data flows are always automatically established between 2 tasks in a /FLOW statement.
When direct data-flows are enabled globally via the /DEFAULTFLOW option, individual data flows between two tasks are specified as not direct with the NOTDIRECT attribute. Conversely, when direct data-flows are disabled globally, individual data-flows are specified as direct with the DIRECT attribute.
Direct data-flows bypass writing the intermediate file to disk for better performance. This attribute can only be specified for data flows connecting a single file target to a single file source.
When the VERIFY keyword is specified and direct data flow optimization is enabled, Connect ETL generates a warning that identifies the data flows connecting a single file target to a single file source that cannot be treated as direct data flows and provides the reasons.
Data-flows cannot be optimized into direct data-flows in the following cases:
- The source or target of the data flow has more than one connection. If a task has two data-flow connections to two other tasks from its single target file, for example, neither of these data flows can be optimized into direct data-flows.
- The source of the data-flow contains a header layout.
- The file name of the data flow target file contains a wildcard pattern.
- The DTL job is customized with a third-party language.
- Optimizing the data-flow into a direct data-flow creates a job cycle, which causes an infinite loop in the run sequence.
Examples
/FLOW task1 task2
/FLOW task1 task2 DATAFLOW MAPREDUCE
A MapReduce flow between two tasks.