A dataflow is a series of operations that takes data from some source, processes that data, then writes the output to some destination. The processing of the data can be anything from simple sorting to more complex data quality and enrichment actions. The concept of a dataflow is simple, but you can design very complex dataflows with branching paths, multiple sources of input, and multiple output destinations.
There are four types of dataflows: jobs, services, subflows, and process flows.
Job
A job is a dataflow that performs batch processing. A job reads data from one or more files or databases, processes that data, and writes the output to one or more files or databases. Jobs run manually through the UI or from a command line using the job executor.
This dataflow is a job. Note that it uses the Read from File stage for input and two Write to File stages as output.
Service
A service is a dataflow that you can access as web services or using the Spectrum Technology Platform API. You pass a record to the service and optionally specify the options to use when processing the record. The service processes the data and returns the data.
Some services become available when you install a Spectrum process. For example, when you install Spectrum Universal Addressing the service ValidateAddress becomes available on your system. In other cases, you must create a service in Spectrum Enterprise Designer then expose that service on your system as a user-defined service. For example, Spectrum Spatial services are unavailable until you create a service using a Spectrum Spatial stage.
You can also design your own custom services in Spectrum Enterprise Designer. For example, the following dataflow determines if an address is at risk for flooding:
Subflow
A subflow is a dataflow that can be reused within other dataflows. Subflows are useful when you want to create a reusable process that can be easily incorporated into dataflows. For example, you might want to create a subflow that performs deduplication using certain settings in each stage so that you can use the same deduplication process in multiple dataflows. To do this you could create a subflow like this:
You could then use this subflow in a dataflow. For example, you could use the deduplication subflow within a dataflow that performs geocoding so that the data is deduplicated before the geocoding operation:
In this example, data would be read in from a database then passed to the deduplication subflow, where it would be processed through Match Key Generator, then Intraflow Match, then Best of Breed, and finally sent out of the subflow and on to the next stage in the parent dataflow, in this case Global Geocode. Subflows are represented as a puzzle piece icon in the dataflow, as shown above.
Subflows that are saved and exposed are displayed in the User Defined Stages folder.
Process Flow
A process flow runs a series of activities such as jobs and external applications. Each activity in the process flow runs after the previous activity finishes. Process flows are useful if you want to run multiple flows in sequence or if you want to run an external program. For example, a process flow could run a job to standardize names, validate addresses, then invoke an external application to sort the records into the proper sequence to claim postal discounts. Such a process flow would look like this:
In this example, the jobs Standardize Names and Validate Addresses are exposed jobs on the Spectrum Technology Platform server. Run Program invokes an external application, and the Success activity indicates the end of the process flow.