When an analysis is executed, it performs the enhancements and analyses specified in its design and pushes manipulated data to data store outputs. These data store outputs can then be used by other data stages.
Run an analysis
An analysis must be run at least once before its data store outputs can be used by other data stages. When you run an analysis for the first time, manipulated data is pushed to one or more data store outputs.
If the data store input receives new data from its source, you will need to re-run the analysis to manipulate the newly received data. Complete the following steps to run an analysis:
- Navigate to the pipeline and path that contains the analysis.
- Click the menu button to the right of the analysis that you want to run and select Execute > Run.
The Execution Parameters dialog displays any available parameters for the data load range of all data stores that are included in the analysis. The Name and Data Type of these parameters cannot be changed, and you cannot delete them.
- Add or delete other parameters as required, using the buttons provided.
The Data Type of these parameters is always
String
, but you can edit the Name and Value as required. - Click Run.
The Run/Rebuild Started dialog displays, and you can choose whether to stay on the current screen or view the execution in the execution status screen. For more details about the information and actions available from the execution status screen, see Executing data stages.
Rebuild an analysis
Rebuilding an analysis overrides any data load range settings on data store inputs and loads all data into the analysis' data store inputs. This option can be useful when developing a new analysis for the first time. Complete the following steps to rebuild an analysis:
- Navigate to the pipeline and path that contains the analysis.
- Click the menu button to the right of the analysis that you want to rebuild and select Execute > Rebuild.
The Confirm Rebuild dialog is displayed.
- To rebuild, click Yes, or click No to cancel the rebuild.
- If you are rebuilding a streaming analysis, the Execution Parameters dialog is displayed. For every streaming data store in the analysis you can select a value for Starting Offsets. For information about the possible values you can enter, see Streaming execution parameters.
- Click OK to start the rebuild.
Execution record counts and caching
If global caching is turned off, you will notice record counts at nodes that are much greater (usually, by multiples of two or three) than the number of known records in your data store input. This is a side effect of the recomputation that occurs when caching is disabled. For example, you might know that a data store has only 2,000,000 records; however, when viewing its executions history, you find that 4,000,000 records were processed at this data store's input node.
In this particular situation, the 4,000,000 represents the fact that the analysis' flow of execution needed to visit the data store input node twice, once for each node the flow splits into. As a result, technically, 4,000,000 records were processed, even though the data store only contains 2,000,000.
Field mapping in data store outputs when an analysis changes
If an analysis has run and you want to make modifications to it that will add, remove, or rename fields from your data store outputs, you will first need to:
- Make modifications to the data store outputs outside of the analysis.
- Make changes within the analysis.
Adding a field
The following example describes how to add a field to an analysis after the analysis has already been run. In this case, the analysis pushes data to four fields in a data store output node, and you want to add a fifth field.
- Select the Pipelines menu at the top of the page.
- Click the menu button to the right of the data store then select Edit > Edit Stage.
- Click the Fields tab and add the new field to the data store, then save your changes.
- Return to the Pipelines view, and open your analysis.
- Create the new field as a column within the analysis and map the new field in the data store output node.
Renaming a field
If an analysis has already run and you want to change the name of a field in one of its data store outputs:
- Select the Pipelines menu at the top of the page.
- Click the menu button to the right of the data store then select Edit > Edit Stage.
- Click the Fields tab and rename the field, then save your changes.
- Return to the Pipelines view, and open your analysis.
- Within the analysis, map the renamed field in the data store output node.
Removing a field
If an analysis has already run, you can remove a field from its data store output as follows:
- Select the Pipelines menu at the top of the page.
- Click the menu button to the right of the data store then select Edit > Edit Stage.
- Click the Fields tab and remove the field, then save your changes.
- Return to the Pipelines view, and open your analysis.
- Within the analysis, clear the name of the field in the data store output node to remove it from the mapping.
Field mapping when data store outputs are used by other data stages
If you have run an analysis with a data store output that is being used by other data stages, remapping can quickly become a complicated process because you will need to make your modifications in all data stages that use the affected data store.
To aid you in this process:
- Select the Pipelines menu at the top of the page.
- Click the menu button to the right of the data store and select Find Usages.
If the analysis data store outputs are not yet being used by other data stages, it can often be easier to delete the Data Store Output node within your analysis and then push to a new Data Store Output instead.
Schedule executions
Users with administer permissions to an analysis can schedule its executions. Analysis execution scheduling is primarily dependent upon the frequency at which data store inputs receive new data.
For example, if you have a data store that receives new data every day, you could schedule an analysis that uses that data store to execute on a daily basis, after the new data was received. When it executes, the analysis performs the manipulations specified in its data store input. The execution occurs on all data, including new data. This creates a new output data set, which the analysis pushes to its data store output when execution finishes.
Scheduling analysis executions to keep up to date with newly arrived data can be used to effectively implement automated data transformation and machine learning, in real time.
Execution of streaming analysis
You can execute an analysis that contains streaming nodes, with the following differences from batch analysis execution:
- A streaming analysis runs until it is terminated or fails.
- Running and rebuilding differ in behavior between batch and stream analysis. There is no concept of ref time (cycles) for a streaming analysis. Instead, rebuilding a streaming analysis cleans out the checkpoint directory before kicking off a run.
- Rerun is not supported for streaming analysis.
- Rollback is supported for streaming analysis, but in most cases it is not applicable, because rerun is not supported.
Streaming execution parameters
When you rebuild a streaming analysis you can provide an optional Starting Offsets value for every streaming data store in the analysis. Select or enter one of the following values:
- earliest
- latest
- A JSON string specifying a starting offset for each topic specified by the streaming data store. You can specify earliest or latest for a topic in the JSON string by using the value "-2" for earliest, and "-1" for latest. For example,
{"topicA":{"0":23,"1":-1},"topicB":{"0":-1}}
.