You can use an analysis to manipulate data. The general process flow of any analysis is as follows:
(1) In the first phase of design, data store inputs are selected.
(2) In the second phase, fields from these data stores are manipulated, as they move through some combination of Enhance, Combine, Shape, Check, System, and Analytics nodes.
(3) In the third phase, manipulated fields are pushed to and stored in one or more data store outputs, which are then saved in a pipeline to be used by other data stages.
Before you begin
Before you can create an analysis, you will need:
- Create Analysis and Write permissions to a pipeline.
- Access to at least one data store.
If you have prior knowledge of a data set, you may be able to approach design with a specific output in mind. On the other hand, the Analysis Designer can also work as a tool for experimentation and exploration, with no prior knowledge required.
Configure analysis designer nodes
- Select the Pipelines menu at the top of the page.
- Click the menu button to the right of the path in which you want to create the analysis:
- Select New > Analysis.
- Drag and drop nodes from the left of the screen onto the canvas.
- Connect the nodes to build your analysis.
- Select each node in turn and configure the properties on the right of the screen.
- Select the Current and Downstream Nodes button to choose how much interactive execution will occur on the current node.
- Select Apply Changes to save the changes.
Copy and paste nodes
- Select the appropriate node or nodes.
- Click Copy selection to the clipboard.
- Click Paste from clipboard.
- Drag the copied node or selection to the appropriate place on the canvas.
- Update the properties of the selection, as appropriate.
There is no relationship between the original node or selection and the copied one, other than having similar properties. As a result, a change in one will not be reflected in a change in the other.
For more information about specific nodes, see Analysis Designer Nodes.
Configure analysis settings
- Select the Pipelines menu at the top of the page.
- From the Pipelines browser on the left of the screen, select your analysis.
- From the Analysis screen, click the Settings button to open the Analysis Settings dialog.
- Configure the properties, as required.
Details properties
- Description - Optionally, enter a description for the analysis. If you enter a description, this will be displayed in a tooltip when you hover over the analysis in the Pipelines view.
- Log Limit - Set the maximum number of errors that will be displayed in the Analysis Designer's output log when testing. This is also a maximum threshold for the number of errors that you want the executor to allow before exiting the Analysis run and marking the run as Failed.
-
Caching and Record Count - Specify whether caching will occur for all nodes in your Analysis, and whether to collect accurate record counts for all nodes. Caching the output of a node can increase the speed of an analysis when recomputes are required at points where the analysis splits. Caching the output of all nodes is recommended on smaller data sets, to save time. With larger data sets, however, global caching can cause a significant decrease in performance.
Cache output and collect accurate record counts for all nodes is turned on by default.
Tip: For larger data sets, you can use the Cache Data node to only cache the output of specific nodes. See Cache Data Node. - Sampling Data for Testing - Set the sample size of data used for testing an Analysis. Note that this setting will interact with the sampling settings made in Data Store Inputs and Sample nodes. Analysis-level settings are applied before settings made at the node level.
When collecting accurate record counts, the system's Execution History will track record counts at each node within an analysis.
Runtime properties
You can create runtime properties by selecting a Data Store and mapping a field containing names to a field containing values. You can then reference the property via the name field throughout the Analysis, using the RUNTIME()
function.
For example, consider the following sample data set:
id |
value |
---|---|
001 |
100 |
002 |
200 |
003 |
300 |
Runtime data set
If you were to select id as the Property Name Field and value as the Property Name Value, you could use RUNTIME(id)
in any node within your analysis to return the values for each unique name in your property.
In this case, RUNTIME(id)
would return the following:
New Column |
---|
100 |
200 |
300 |
Note that values in the Property Name Value should be unique per name. If a name has more than one value associated to it, only the first value found will be returned for all records.
Also note that to use Runtime Properties in interactive mode - that is, within the sheets of nodes while building your Analysis - you need to use the Test button to simulate a run of the entire Analysis (rather than using Test Sheet).
Execution profile properties
You can override the default execution settings that are used when the Analysis is executed by using one or more of these properties:
- Environment Execution Profile - Select a predefined execution profile to inherit the execution settings from a profile that has been created on your environment by an administrator.
- Overriding Execution Sizing - Check this box to edit the execution settings for the selected Analysis. The options that are available, and the default values, will vary according to the selected Cluster Type.
- Execution Property - Check this box, then click the Add button to create new execution properties.
You can use a combination of these settings. For example, an Environment Execution Profile may include execution sizing configuration that you want to inherit on your Analysis, but you may want to also add a new execution property for this specific Analysis. Or, you could override the execution sizing settings of the Environment Execution Profile, while inheriting the execution properties defined on the profile.
Execution parameters
Define named parameters that you want to be provided for execution of the analysis. Parameters defined here will cause the execution parameters dialog to show those named parameters when you resample data, test or execute the analysis. Similarly, when you accept configuration changes for a node, the system will prompt you for values of these properties so that the interactive node sampling can use the values.
Use the Add From Analysis button to have the system look for named parameter references in the nodes of your analysis. Alternatively, add, modify or remove the named parameters using the Add, Edit and Delete buttons. Some parameters, such as those for passwords, can be set to be encrypted properties so that the clear text values aren’t shown on the executions page. Parameters used in the Execute Query in DB and Data Store Input nodes may have values that look up the actual value in the AWS Secrets Manager service configured for the product. These values take the form: sec(secret_name optional_fieldname)
, where secret_name
is the secret arn or secret name, and optional_fieldname
is when the secret contains multiple name/value parts.
Showing and working with sheets
Sheets are displayed in the grid at the bottom of the Analysis Designer screen.
When building a new analysis, it is recommended that you keep the grid icon selected to display the test data sheet.
Every time you add a new node to your analysis, a new sheet will be generated automatically. Each sheet will then display what is happening to your fields at that point in the analysis.
Adding new columns
When working with sheets, you can add new columns (i.e. fields) that are comprised of functions that manipulate other fields.
To add a new column, click the New Column button on the sheet toolbar.
Once columns are added, they can be treated like any other field and pushed to new Analysis Designer nodes.
Unmasking secure fields
If you have permission to unmask a secure field, you may unmask records in bulk using the Unmask All button. Also note that only secure fields may be unmasked. Fields that are encrypted but not secure will be shown as masked in analysis sheets, however, you cannot use the Unmask All button to unmask these values.