Data Retention
This property is used to set a time limit on how long a data view will retain incoming data. It is useful in cases where data becomes irrelevant after a certain amount of time has passed.
Retention Period defines the length of time for which a Data View will store data. Data Retention applied at the Data View level will override Retention Period settings at the Environment level made by your Administrator. Time is always calculated from the time data is loaded into a Data Stage.
Period Precision defines the precision at which a Retention Period is applied.
For example, if a Data View's Retention Period was set to 1 year and its Period Precision was set to Month, the Retention Period would span one year back from the current day, plus the remaining days to get to the beginning of the month.
Retention Example 1
For instance, suppose the current day is 6/5/2016.
If Data View Retention Period was set to 1 year and Period Precision was set to Month, the system would first look back one year to 6/5/2015. Then, since Precision is set to Month, the system would go to the beginning of the month, to 6/1/2015. At this point, any data that was loaded into the Data View before 6/1/2015 would be removed.
Retention Example 2
As another example, suppose the current day is 6/5/2016.
If Environment Retention Period was set to 1 year and Period Precision was set to Quarter, the system would first look back one year to 6/5/2015. Then, since Precision is set to Quarter, the system would go to the beginning of the quarter, to 4/1/2015. At this point, any data that was loaded into the Data View before 4/1/2015 would be removed.
Retention Period Over Time
Data Retention is applied once per day, and is calculated from the end of the most recent successful Run or Rebuild. If a Run or Rebuild fails, Data Retention will be calculated from the failed Run/Rebuild's start time.
Retention Period Data Flush
Over time, data will be flushed from Data Views based on the Data Retention settings specified. For instance, in the example above, data from Load 1 on 4/29/2015 is flushed on 5/1/2016. This is because a 1 year retention period with Month precision is used; so, on 5/1/2016, the 4/29/2015 Load falls outside of the calculated Retention period (which spans from 5/1/2015 to 5/1/2016).
Interaction with Data Store Retention
When setting Data View Retention, you also need to consider any Retention settings made to the Data Stores that feed the Data View because these settings may interact.
Data Partition
This property is used to index data into partitions. Partitioning can optimize load time when a data view is queried during data exploration.
Identity Fields
When a Data View contains Secure Fields, Identity Fields are required to ensure the ability to uniquely identify each record in your dataset.
Doing so will protect against cases where the Secure Field was assumed to be the Identity Field, because once a Secure Field is encrypted to a random value, you will no longer be able to use it to uniquely identify records. This is mostly important for auditing purposes, so that the system can accurately track which users have viewed the content of a secure field.
As such, in addition to being required when creating Secure Fields, it is also required that your Identity Field(s) be different fields than your Secure Fields.
Data Load
This setting allows you to specify which data stores to pull from during execution and which parts of those data stores should be loaded.
Default Behavior: New Data Since Last Load, for all Data Stores used in the Data View.
Data Load Range Options
All
This setting will pull all data currently residing in the Data View's Data Stores, whenever the Data View runs. For this reason, All works best for static Data Stores.
Because the All setting pulls all data currently residing in a Data View's Data Stores, it can cause unwanted results if you happen to Run the Data View multiple times.
As a simple example, consider a situation where you create a Data Store and load it with 1000 records. If you were to build a Data View using this Data Store, set Data Load to All, and then run the Data View once, it would load 1000 records. If you were to then run the Data View a second time, however, it would load the same 1000 records again, on top of the original 1000 from the first run.
When using the All setting, you therefore may need to use the Delete All Data feature to empty your Data View between Runs, to prevent record duplication. Alternatively, you could set Data Load to New Data Since Last Load.
New Data Since Last Load
This setting will only pull data from the Data View's Data Stores that was not present during the Data View's previous run. For this reason, New Data Since Last Load works best for dynamic Data Stores.
As another simple example, consider a situation where you create a Data Store and load it with 1000 records.
If you were to build a Data View using this Data Store, set Data Load to New Data Since Last Load, and then run the Data View once, it would load 1000 records. If you were to then run the Data View a second time - without loading any new data into the Data Store - it wouldn't load anything; however, the original 1000 records would remain.
Alternatively, if you loaded 458 new records into the Data Store and then ran the Data View, it would load those 458 records, on top of the original 1000.
Based on File Path Pattern Parameter
In the diagram above, we have a configuration where a Data Store points to an external file system. When using this Data Store in a Data View, a) you can select a Data Load Range of Based on File Path Pattern Parameter in the Data View, to only pull in files with a file path that matches a specific pattern. After creating the Data View, you will need to create a Process Model. Within the Process Model, you will then need to b) create a variable with the same name as the parameter created in the Data View, within a node that serves as an input to c) an Execute Stage Task that runs the Data View. When the variable is created in part b, you can set the pattern that you want matched as the variable's value.
After creating such a Process Model, the Data View referenced by the Execute Stage Task will only pull in files from its Data Stores that match the file path pattern set in part b - when the Process Model is used to execute the Data View.
Based on File Path Parameter
This setting allows you to work with a specific file. When this option is selected, the Execute Stage Task will only pull in the specified file from its Data Stores. The path parameter value must be a sub path under the data store's root folder. For example, if the data store root is /bucket/folder, and the file is test.csv in that folder, then the parameter value should be just test.csv. The system will add the root path automatically and will not allow an alternative root path to be used. If the specified file is not found, processing will fail instead of processing zero records as the Based on File Path Pattern Parameter option does.
Historical Data Handling
Historical data handling can be used when an identity field is associated to many records that pile up/change day after day - for example, a claim that updates with new data each day.
With such a claim, historical data handling could be used to specify which record to look at, based on date - for example the most recent date available.
Historical Data Handling Example
Suppose you have a Data View with the following records:
Contents of Historical Data Handling Data View
id |
date |
measure |
---|---|---|
001 |
8/16/2016 |
1000 |
001 |
8/17/2016 |
2000 |
Without Historical Data Handling, this Data View would display 2 records for id 001: one record from 8/16/2016 and one record from 8/17/2016. If however you were to use Historical Data Handling, setting Identity Field to id and Date Time Field to date, the Data View would only display 1 record, from 8/17/2016. This is because Historical Data Handling displays the most recent record for each Identity Field.
As your Data View executes over time, the most recent record for a chosen Identity Field may change. This setting can allow you to only display that record in Dashboards.
Use History Uniquing for Queries
This check box should be checked to enable historical data handling for every query that is performed on the Data View. If the check box is not checked, historical data handling will not be performed.
Delete Duplicate Data After Load
This setting allows you to delete any duplicate records in a Data View after the Data View loads. To utilize this feature, you first need to set an Identity Field for the Data View, so that each record within the Data View can be uniquely identified. Within the Delete Duplicate Data After Load field set, you will then need to specify a field to sort by and whether sorting should be performed in ascending or descending order.
When performing duplicate data deletion, the system will group all records by the chosen Identity Field(s), sort by the specified sort field and direction, and then retain the first record in each group.