Create an analysis - Data360_DQ+ - Latest

Data360 DQ+ Help

Product type
Software
Portfolio
Verify
Product family
Data360
Product
Data360 DQ+
Version
Latest
Language
English
Product name
Data360 DQ+
Title
Data360 DQ+ Help
Copyright
2024
First publish date
2016
Last updated
2024-10-09
Published on
2024-10-09T14:37:51.625264

You can use an analysis to manipulate data. The general process flow of any analysis is as follows:

(1) In the first phase of design, data store inputs are selected.

(2) In the second phase, fields from these data stores are manipulated, as they move through some combination of Enhance, Combine, Shape, Check, System, and Analytics nodes.

(3) In the third phase, manipulated fields are pushed to and stored in one or more data store outputs, which are then saved in a pipeline to be used by other data stages.

Before you begin

Before you can create an analysis, you will need:

  • Create Analysis and Write permissions to a pipeline.
  • Access to at least one data store.

If you have prior knowledge of a data set, you may be able to approach design with a specific output in mind. On the other hand, the Analysis Designer can also work as a tool for experimentation and exploration, with no prior knowledge required.

Configure analysis designer nodes

  1. Select the Pipelines menu at the top of the page.
  2. Click the menu button to the right of the path in which you want to create the analysis:
  3. Select New > Analysis.
  4. Drag and drop nodes from the left of the screen onto the canvas.
  5. Connect the nodes to build your analysis.
  6. Select each node in turn and configure the properties on the right of the screen.
  7. Select the Current and Downstream Nodes button to choose how much interactive execution will occur on the current node.
  8. Select Apply Changes to save the changes.

Copy and paste nodes

Tip: If you need to create similar nodes to existing ones, you can copy them from the canvas and paste them elsewhere, using the Copy and Paste controls in the Analysis toolbar.
  1. Select the appropriate node or nodes.
  2. Click Copy selection to the clipboard.
  3. Click Paste from clipboard.
  4. Drag the copied node or selection to the appropriate place on the canvas.
  5. Update the properties of the selection, as appropriate.

There is no relationship between the original node or selection and the copied one, other than having similar properties. As a result, a change in one will not be reflected in a change in the other.

Note: You can only copy and paste nodes within the current browser tab.

For more information about specific nodes, see Analysis Designer Nodes.

Configure analysis settings

  1. Select the Pipelines menu at the top of the page.
  2. From the Pipelines browser on the left of the screen, select your analysis.
  3. From the Analysis screen, click the Settings button to open the Analysis Settings dialog.
  4. Configure the properties, as required.

Details properties

  • Description - Optionally, enter a description for the analysis. If you enter a description, this will be displayed in a tooltip when you hover over the analysis in the Pipelines view.
  • Log Limit - Set the maximum number of errors that will be displayed in the Analysis Designer's output log when testing. This is also a maximum threshold for the number of errors that you want the executor to allow before exiting the Analysis run and marking the run as Failed.
  • Caching and Record Count - Specify whether caching will occur for all nodes in your Analysis, and whether to collect accurate record counts for all nodes. Caching the output of a node can increase the speed of an analysis when recomputes are required at points where the analysis splits. Caching the output of all nodes is recommended on smaller data sets, to save time. With larger data sets, however, global caching can cause a significant decrease in performance.

    Cache output and collect accurate record counts for all nodes is turned on by default.

    Tip: For larger data sets, you can use the Cache Data node to only cache the output of specific nodes. See Cache Data Node.
  • Sampling Data for Testing - Set the sample size of data used for testing an Analysis. Note that this setting will interact with the sampling settings made in Data Store Inputs and Sample nodes. Analysis-level settings are applied before settings made at the node level.

When collecting accurate record counts, the system's Execution History will track record counts at each node within an analysis.

Runtime properties

You can create runtime properties by selecting a Data Store and mapping a field containing names to a field containing values. You can then reference the property via the name field throughout the Analysis, using the RUNTIME() function.

For example, consider the following sample data set:

id

value

001

100

002

200

003

300

Runtime data set

If you were to select id as the Property Name Field and value as the Property Name Value, you could use RUNTIME(id) in any node within your analysis to return the values for each unique name in your property.

In this case, RUNTIME(id) would return the following:

New Column

100

200

300

Results from call to RUNTIME(id)

Note that values in the Property Name Value should be unique per name. If a name has more than one value associated to it, only the first value found will be returned for all records.

Also note that to use Runtime Properties in interactive mode - that is, within the sheets of nodes while building your Analysis - you need to use the Test button to simulate a run of the entire Analysis (rather than using Test Sheet).

Execution profile properties

You can override the default execution settings that are used when the Analysis is executed by using one or more of these properties:

  • Environment Execution Profile - Select a predefined execution profile to inherit the execution settings from a profile that has been created on your environment by an administrator.
  • Overriding Execution Sizing - Check this box to edit the execution settings for the selected Analysis. The options that are available, and the default values, will vary according to the selected Cluster Type.
  • Execution Property - Check this box, then click the Add button to create new execution properties.

You can use a combination of these settings. For example, an Environment Execution Profile may include execution sizing configuration that you want to inherit on your Analysis, but you may want to also add a new execution property for this specific Analysis. Or, you could override the execution sizing settings of the Environment Execution Profile, while inheriting the execution properties defined on the profile.

Note: If you promote or import an Analysis that references an Environment Execution Profile, the system will automatically create an empty profile with the same name on the target Environment, if it does not already exist. An administrator will then need to configure the details of the Environment Execution Profile on the target environment to enable the successful execution of the associated Analysis.

Execution parameters

Define named parameters that you want to be provided for execution of the analysis. Parameters defined here will cause the execution parameters dialog to show those named parameters when you resample data, test or execute the analysis. Similarly, when you accept configuration changes for a node, the system will prompt you for values of these properties so that the interactive node sampling can use the values.

Use the Add From Analysis button to have the system look for named parameter references in the nodes of your analysis. Alternatively, add, modify or remove the named parameters using the Add, Edit and Delete buttons. Some parameters, such as those for passwords, can be set to be encrypted properties so that the clear text values aren’t shown on the executions page. Parameters used in the Execute Query in DB and Data Store Input nodes may have values that look up the actual value in the AWS Secrets Manager service configured for the product. These values take the form: sec(secret_name optional_fieldname), where secret_name is the secret arn or secret name, and optional_fieldname is when the secret contains multiple name/value parts.

Showing and working with sheets

Sheets are displayed in the grid at the bottom of the Analysis Designer screen.

When building a new analysis, it is recommended that you keep the grid icon selected to display the test data sheet.

Every time you add a new node to your analysis, a new sheet will be generated automatically. Each sheet will then display what is happening to your fields at that point in the analysis.

Adding new columns

When working with sheets, you can add new columns (i.e. fields) that are comprised of functions that manipulate other fields.

To add a new column, click the New Column button on the sheet toolbar.

Once columns are added, they can be treated like any other field and pushed to new Analysis Designer nodes.

Unmasking secure fields

If you have permission to unmask a secure field, you may unmask records in bulk using the Unmask All button. Also note that only secure fields may be unmasked. Fields that are encrypted but not secure will be shown as masked in analysis sheets, however, you cannot use the Unmask All button to unmask these values.