Data Store Input nodes are usually the starting points of every Analysis. They are used to select which data stores you would like to manipulate.
Selecting a Data store
To select the data store to use as an input, use the Data Store drop down in the Properties panel. Available stores will be those you have been granted access to, in Pipelines.
Selecting a Data Load Range
When selecting a Data Store Input, it is important to specify a Data Store Load Range. This setting will define which data is read by the Analysis, and ultimately which data is placed in Data Store Outputs when the Analysis executes.
Data Load Range Options
Option |
Effect |
Useful when |
---|---|---|
All |
Analysis will read all data from the Data Store Input, every time the Analysis Runs. |
Running an Analysis for the first time. Note: This option should not be used with Data Stores that have retention processing turned on. |
New Data Since Last Load |
Analysis will only read data that is new, since the last time the Data Store Input loaded. If no new data is found, no changes will be made to the Analysis' outputs. |
No new data transformations are applied to input data between Analysis Runs, but there is new data to apply transformations to. |
Based on Date Parameters |
Analysis will only read data from a specified date range when the Analysis executes. |
This setting can allow you to work with a subset of your data, which can be altered by input or Analysis Run to create outputs parsed by date. |
Based on Work ID Parameter |
Analysis will only read data from a specified Work ID. |
This setting can allow you to work with a subset of your data, which can be altered by input or Analysis Run to create outputs parsed by Work ID. |
Based on File Path Pattern Parameter |
Analysis will only read data from files with a file path that matches the pattern parameter. |
This setting allows you to work with a subset of your data, sourced only from a file path that matches the pattern used in a parameter. |
Based on File Path Parameter | Analysis will only read data from the specified file, and will fail if the file is not found. The path parameter value must be a sub path under the data store's root folder. For example, if the data store root is /bucket/folder, and the file is test.csv in that folder, then the parameter value should be just test.csv. The system will add the root path automatically and will not allow an alternative root path to be used. | This setting allows you to work with a specific file, avoiding a file scan and failing if the indicated file does not exist. |
Preventing an Analysis from reading data that is past retention period
When a Data Store is configured with a load range of All, New Since Last Load with no previous execution, New Since Last Load on a rebuild, or By Date with start time of null, the system will check any data being read to make sure that it is within the Data Store's retention period at the time the Analysis starts execution.
Data Retention Process Delay
The system retention process deletes data that is past the retention period only after at least 24 hours have passed since the data was eligible for deletion. This allows any processes that are still reading that data to complete execution.
Parameterized Data Loads
As mentioned above, Date based or Work Id based parameters may be used to control how an Analysis loads its data. As shown in the following examples, these parameters must be created in both the Data Store Input of the Analysis and in the node preceding the Execute Stage Task that will execute the Analysis in the Process Model.
Using Date Parameters
When the Based on Date Parameters Data Load Range Option is used, both a From Date Parameter and To Date Parameter need to be specified. Additionally, these same parameter names need to be used in a Process Model to actually enforce the date range.
Date Parameter Example
Suppose, as illustrated in the top half of the diagram above, that you had an Analysis with a Data Store Input. Imagine that this Data Store Input contained years' worth of Data, but that you only wanted to pull Data that was loaded into it during a specific date range.
To do so, you would first need to create Date Parameters within the Data Store Input of the Analysis; in the illustration, this is the Analysis at Point b. After that, you'd need to use those parameters in a Process Model.
Within the Process Model, you could create the parameters as User Variables in the node preceding the Execute Stage task that runs the Analysis. In the bottom half of the diagram, this is represented by Point a. These User Variables would need to have the same names as the parameters created in the Analysis at Point b - fromDate and toDate - and they could be Literal variables of the Datetime type that equaled literal Dates.
For example, you could set fromDate to 10/24/2016 and toDate to 11/24/2016.
With such a configuration, the Execute Stage task at Point c would cause the Analysis at Point c to only load data that was loaded into the Analysis' Data Store Input between 10/24/2016 and 11/24/2016.
Using Work ID Parameters
Like Date Parameters, Work ID Parameters need to be combined with usage in a Process Model to be properly applied.
Work ID Parameter Example
In the diagram above, we have a configuration where two Analyses output to the same Data Store (c). Suppose we have a third Analysis that uses that Data Store (c) as an input, but in this Analysis we only want to load Data that was pushed to Data Store (c) by Analysis b. To accomplish this, we can use a Work Id Parameter.To do so, you could set Analysis d's Data Store Input node (c) to have a Data Load Range that is Based on Work ID and create a Work ID Parameter. Here we will call the Work ID Parameter parameterName.
After creating the Work ID Parameter in Analysis d, you would then need to create a Process Model (lower half of diagram above). The Process Model would need to include a Start Stage Event that triggers the Process Model when Data Store a finishes loading (a). It would then need an Execute Stage Task that executes Analysis b.
After the Execute Stage Task, you could then create an Exclusive Gateway. Within the Exclusive Gateway, you would need to create a property reference variable with a name matching the Work ID Parameter you created in Analysis d and a value equal to the work id of the preceding Analysis, Analysis b.
That is, parameterName = {result}.workId
This variable would then control how Analysis d loads in the Execute Stage Task for Analysis d. With this setup, Analysis d would only pull data associated with the latest work id of Analysis b, even if Analysis b1 may have also pushed data to Data Store c.
Using a File Path Pattern Parameter Name
In the diagram above, we have a configuration where a Data Store points to an external file system. When using this Data Store as a Data Store Input node in an Analysis, a) you can select a Data Load Range of Based on File Path Pattern Parameter, to only pull in files with a file path that matches a specific pattern. After creating the Analysis, you will need to create a Process Model. Within the Process Model, you will then need to b) create a variable with the same name as the parameter created in the Analysis, within a node that serves as an input to c) an Execute Stage Task that runs the Analysis. When the variable is created in part b, you can set the file path pattern that you want matched as the variable's value.
After creating such a Process Model, the Analysis referenced by the Execute Stage Task will only pull files into the Data Store Input node from part a that match the file path pattern set in part b - when the Process Model is used to execute the Analysis.
Filter Records Tab
The Filter Records tab allows you to create a filter expression for your JSON, ORC, or Parquet file based Data Store at the Data Store Input node level. Filter expressions can be created using the general expression language used throughout Data360 DQ+ or SQL.
SQL expressions are applied after has been "flattened", or in other words, after the edit transformations have been applied to the data. General expressions are applied before the data has been flattened/transformed, and must produce a string value that can be in turn used as an interpreted expression by the spark engine. You should prefer to use SQL unless there is a compelling need for pre-flattened/transformation filtering. In the example that follows, the expression shown works as an SQL expression. If the field called age were a top level field in the source, the equivalent General expression would be 'age > 19'. If you were to filter on the name field, a General expression might look like 'name = \'Andy\''
As an example, consider the following data set.
name |
age |
---|---|
Justin |
19 |
Michael |
5 |
Andy |
30 |
To return the single record belonging to Andy, you could write the following expression, using either the product's general expression language or SQL:
age > 19
Doing so would produce the following data set:
name |
age |
---|---|
Andy |
30 |
Filter Fields Tab
The Filter Fields tab allows you to filter a Data Store Input of the JSON, ORC, or Parquet format at the field level. If the tab's functionality is turned on by checking its check box, then only fields that are added to the parameter are those that will be passed through the filter. If the tab's functionality is turned off, all fields will be passed through the filter.
As an example, consider the following data set.
name |
age |
---|---|
Justin |
19 |
Michael |
5 |
Andy |
30 |
Turning on the Filter Fields functionality and selecting the age field would produce the following result.
age |
---|
19 |
5 |
30 |
Overflow Field Name
For data files that contain more fields than are present in the Data Store definition, when lenient parsing is set, an Overflow Field name parameter will become available in the Data Store Input node's properties panel. Content from the extra fields will then be pushed into this Overflow Field in the JSON format when the Analysis runs.