Data Store Output nodes are analysis end points. They are used to create or choose a data store that holds the fields manipulated by the Analysis Designer.
Selecting a Data Store
Before designing an Analysis, you can choose to create a data store ahead of time to serve as your output data store. If you have done so, use the data store drop down to select it from the Pipeline where it was saved.
Creating a New Data Store
If you haven't created a Data Store to serve as your output store ahead of time, you can create a new data store from inside the Analysis Designer.
To do so, use the "Create New Data Store" check box, and you can create the store within a Pipeline on the fly.
Maximum Output Partitions Setting
This option sets the maximum amount of files that the newly created Data Store Output can be comprised of. These are the files that contain the Data Store's data, which are available within the Data Store's View Content Screen.
For performance, Maximum Output Partitions is set to 512 by default; however, you can set it to any number that you'd like. You should note, however, that using a small number of output partitions on a large dataset can cause your Analysis to have a much longer runtime.
Decrypt Secure Values: Admins Only
If you are a system admin and you have secure, encrypted fields within an Analysis, you can opt to decrypt them in your Data Store Outputs by checking the Decrypt Secure Values check box. This option is available for all External Store Types.
Note that encrypted values will not be immediately decrypted in the Data Store Output node's sheet.
Field Mapping
As data passes from Data Store Inputs, through Enhance and Analytics nodes, and into a Data Store Output, the selected fields may change in structure and in name many times.
Field Mapping allows you to map fields from your final Enhance and/or Analytics nodes to your Data Store Output node.
Fields may be mapped from an Enhance/Analytics node to preexisting fields in a Data Store Output node; or, fields can be newly created in the Output node on the fly.
Update Tab
If you would like to update records within a Data Store based on a unique set of key fields that identify each record, you can do so by checking the Update Data in Data Store check box available in the Field Mapping Tab.
Whenever a Data Store Output node with Update configured receives new data, it will check to see whether the values in a record's Key Fields already exist in a record within the Data Store. If they do, the Data Store will update all other fields for that record, rather than create a new record.
For example, suppose Update were configured with id set as the Key Field, and the following record existed in the Data Store:
id |
value |
---|---|
001 |
123 |
Since Update is configured, the next time this Data Store loads, it will check to see whether newly loaded records have an id = 001.
Suppose there are records that do:
id |
value |
---|---|
001 |
456 |
001 |
789 |
Rather than create more records for id 001, the Data Store will update the pre-existing record using the last new record with a matching Key Field value that it finds. This will assign the pre-existing record where id = 001 a value of 789.
Insert when Record to update doesn't existIf Update Data in Data Store has been turned on, you may also enable record insertion when a record to update does not exist, by checking Insert when Record to update doesn't exist. Turning this option on will mean that if the Data Store is set to update, and it encounters a record with new key field values that are not present in the Data Store, then the new record will be inserted into the Data Store.
Options Tab
When outputting to an External Data Store of the S3/HDFS/file system layout type, the Options tab allows you to control how files are partitioned in the external file system and whether the files are compressed.
Output Writer
Determines whether partitioning control is turned on or off. The Enhanced option is what supports partitioning control. The Basic option does not support partitioning control; it uses a deprecated form of paritioning that Analyses used before the Options tab was added.
Partition By
Allows you to select a set of fields to partition the Data Store's contents by. Importantly, the field(s) that you select to partition by will not be included in the data file contents; rather, a folder will be created for each unique combination of values within the partitioning field set.As a simple example, consider the following data set.
name |
amount |
grade |
---|---|---|
bill |
100 |
A |
ann |
200 |
A |
carl |
300 |
A |
leslie |
400 |
B |
tom |
500 |
B |
stacey |
600 |
B |
ben |
700 |
C |
jill |
800 |
C |
tony |
900 |
C |
quinn |
1000 |
D |
Were you to select grade as the Partition By field, you would end up with four folders, A, B, C, and D; and, each folder would contain the records that had the folder's grade, with only the name and amount fields present.For example, the A folder would look something like this:
Compression
Controls whether the files output to the data store are compressed; and, if they are compressed, what type of compression should be used.
Using an external data store as analysis output
You can output the results of an analysis to an external file-based data store. You can either create the data store and then point to it using a Data Store Output node in the analysis, or you can create it on the fly within the analysis.
Outputting executions to unique folders
When using external file-based data stores as analysis outputs, you can set things up so that a new folder that will be used to contain outputted files is created upon each execution. Folder names can be generated using the following set of variables in the Folder or Path field of the external file data store, using the ${variable} syntax:
- workId (the work ID of the current execution)
- refStartTime (in yyyy/MM/dd HH:mm:ss.SSS format)
- refStartTimeYear
- refStartTimeMonth
- refStartTimeDate
- refEndTime (in yyyy/MM/dd HH:mm:ss.SSS format)
- refEndTimeYear
- refEndTimeMonth
- refEndTimeDate
- startTime
- startTimeYear
- startTimeMonth
- startTimeDate
The following variables have been deprecated and you should instead use the startTime, startTimeYear, startTimeMonth and startTimeDate variables:
- now (in yyyy/MM/dd HH:mm:ss.SSS format)
- year
- month
- date
For example ${workId}
where workId
is replaced with the work ID of the current execution.
For rollback to find files and delete them, the ${workId}
variable must be included in the Folder field on AWS deployments, or in the Path field on Azure deployments, Enterprise deployments using Filesystem, and GCP deployments using Google Storage.
For example: /folder1/folder2/year=${startTimeYear}/month=${startTimeMonth}/date=${startTimeDate}/work-id=${workId}
Note that on Azure, the syntax must include work-id=
as in the above example.
Each time the analysis that output to this data store ran, it would create a new folder within the external file system. The folder's name would be the execution start time and work ID. The folder's contents would be the files from the output of the analysis. The next time the analysis ran, it would then create another new folder, named using the execution start time and work ID and containing the files from the output of that run.
Setting up analysis output files in this fashion can be a good way to organize and keep a history of the files that were created during specific analysis executions.
Custom delimiters on outputs
When using an external data store as the output of an analysis, you can set the data store's delimiters to any characters you'd like. After running an analysis that pushed to this data store, you can then download the files that comprise the data store's contents and they will be delimited as you've specified.