Cache Data - Data360_DQ+ - Latest

Data360 DQ+ Help

Product type
Software
Portfolio
Verify
Product family
Data360
Product
Data360 DQ+
Version
Latest
Language
English
Product name
Data360 DQ+
Title
Data360 DQ+ Help
Copyright
2024
First publish date
2016
ft:lastEdition
2024-07-09
ft:lastPublication
2024-07-09T15:09:58.774265

The Cache Data node can be used to save time on Analysis runs that process a large number of records, and where the global Cache Output of Nodes Analysis setting has been turned off.

Caching is most useful at points in an Analysis where the flow of execution is split. For example, consider a situation where a Data Store is sorted by two separate criteria.

 

Here, all records in the Data Store need to be recomputed for each Sort node. With large Data Stores, this can drastically increase execution time.

With the Cache Data node, data can be cached for the split, eliminating the need to recompute. This can drastically reduce execution time, particularly for large data sets.

 

The Cache Output of nodes setting

By default, the Analysis Designer will cache the output of all nodes automatically. i.e. the "Cache Output of Nodes" check box found within Analysis Settings will be checked. Effectively, this means that Analyses will behave as if there were a Cache Data node after every single node, by default (with the exception of Data Store Outputs). It also means that if you want to use a Cache Data node at a specific point, you should turn off the global Cache Output of Nodes setting first.

In general, caching the output of all nodes is recommended on smaller data sets, to save time. For example, caching could be used in the following Analysis, which processes 28,000 records.

 

 

With caching turned off, however, this same Analysis takes about twice as long to run. This is due to the fact that some operations have to visit the previous node and recompute the incoming data set - which takes time. In the following diagram, this is visualized in places where the number of records processed at a node is a multiple of 28,000, that is greater than 28,000 (the number of records in the data set).

 

Note: Global caching can reduce performance on large data sets.

It should be noted that with large data sets, caching all nodes can actually have a negative impact on performance, since the amount of time it takes to create the memory cache is greater than recomputation.

For example, experimentation has found that repeatedly sorting 2,000,000 records (using different sort criteria for each sort) without any caching takes approximately 63% as long as it would when global caching is turned on.