Large entities containing millions of rows and attribute values can yield large numbers of complex data-related issues, and make sequential investigation of the data difficult. In such instances an entity's size creates additional problems when drilling down to details, where the analysis results are extensive and profiling performance is reduced.
If your database includes millions of rows of data, consider the following options for profiling your data:
- Option 1:Preview and profile your data as a dynamic entity
-
BENEFITS Gives you the ability to create an entity without loading the data into a repository.
You link directly to the external data source and do your data profiling on active data in your database.
If the external data source changes, the data displayed in the Control Center also changes.
HOW IT WORKS You open the Control Center and create a dynamic entity that links directly to your data source.
From the Control Center, you can then examine data rows and distinct values, and drill down to details for further investigation.
Access and review your data to determine which rows and attributes you might want to load into a repository for full analysis.
RESTRICTIONS Dynamic entities do not provide metadata analysis and other features such as keys, joins, and dependency analysis.
Precisely does not recommend this option if your business requires complete profiling and discovery of the data in your database.
Note: If your company requires detailed reports and analytical statistics, consider segmenting, splitting, or filtering your database into several smaller entities (see Options 2 through 5).
- Option 2: Segment large databases into more than one entity
-
If the entities you plan to import from your data source contain a large number of attributes and/or rows, consider importing the data as segmented data groups and analyzing them separately as individual entities.
BENEFITS Gives you entities of reasonable size that are easier to analyze.
HOW IT WORKS You preview your data as a dynamic entity to determine logical data segments.
Create a Load Connection to your data source.
Open the Control Center and import each segment sequentially, creating a separate entity for each.
RESTRICTIONS Segmenting data into multiple entities spreads the results over multiple distinct entity objects and may require additional manual reconciliation.
For example, after you run analytics on each entity, you may need to manually group the results and any findings to get a complete data profile.
As a best practice, segment any entity over 10 million rows/25 attributes (or similar combinations) prior to data profiling.
Note: While it is possible to load an entity of this size in a single pass, dividing the data into smaller manageable chunks and then sequentially importing the segments results in a more thorough profiling without the long delays associated with large data loads.
- Option 3: Filter data prior to import to reduce entity size
-
Consider filtering your data around known conditions or values prior to importing it.
BENEFITS Filtering reduces the number of rows imported into a repository.
You can filter your data around known conditions or values, giving you the ability to import only data that has known anomalies or inconsistencies, narrowing your investigation.
For example, if a particular attribute has a set of valid values, you might filter the data in a way that isolates only those values that do not conform, narrowing the rows for investigation.
Fewer rows improves performance during the import and enables a faster data analysis.
Improves response time to drill-downs and other interactive data processes after the data is imported and you begin to work in the Control Center.
HOW IT WORKS Filter out data rows prior to loading into a repository.
To construct a filter, use the Expression Builder. The Expression Builder is an application that lets you select the elements of a logical condition, rule, or filter statement and saves it for you to apply as needed.
RESTRICTIONS You must be very familiar with your data to create well-targeted filters.
You may need to test your filters and then go through several cycles of tuning before you get the results you want.
- Examples
-
Data can be filtered in various ways. Here are a few examples written in English-like query statements:
- Filter data for those values in the attribute account_status that do not match O, D, C or P.
- Filter data for all account balances that are less than 2500 OR greater than 250000.
- Filter data for any credit card number that is not 16 digits in length.
- Filter data for those values in GENDER that do not match M or F.
- Option 4: Split data horizontally
-
By splitting data horizontally, all attributes are loaded, but only for a subset of rows. For example, you might segment the data in a single entity horizontally on a particular key range or date range.
BENEFITS Gives you the ability to create a filter that imports all attributes for a subset of rows. This allows you to divide up a data profiling effort by splitting the data for a single entity. This is very useful for phased data integration projects.
Typically, data integration projects in an enterprise are phased by either geography, business function, product range, or on historical date boundaries.
In this way, only the data that pertains to the source that is being integrated requires profiling. In such cases, if the data for each phase resides in a single entity, it may be preferred to split the data horizontally so that the correct data is being profiled during the correct phase.
HOW IT WORKS You can filter out data rows prior to loading into a repository.
You can use the Expression Builder to construct a filter.
RESTRICTIONS Precisely recommends splitting the data horizontally whenever possible, preserving all the attributes for a particular set of rows in a single import operation. It is less advisable to split data vertically.
Be sure to verify your profiling results. Metadata such as min, max, average, sum, frequency, pattern, metaphone, and dependencies will be for the range of rows you selected, and not for all rows in the source entity.
If you require an in-depth profile of all rows, and have split this data across several entities, you must combine the results manually.
- Option 5: Split data vertically
-
When you split data vertically, only those attributes for a specific set of rows are imported. Although it is advisable to split data horizontally whenever possible, in those cases where you must profile all rows, and there are no logical segments along which to split the data, you can split the data vertically.
BENEFITS Allows you to import all rows in a single entity for profiling.
HOW IT WORKS You can choose which attributes to load into the repository.
Use the Expression Builder to construct the filter.
The following is the recommended process for splitting data vertically:
Step 1: Load a 10% sample
Load a 10% data sample of the entity with ALL attributes. This enables you to see what problems exist and gives a good indicator of what to expect across the entire row set.
Step 2: Run dependency analysis
Using the sample, run the dependency analysis to identify rows with keys and show the correlation between the key attributes and dependent attributes.
Step 3: Load ALL rows with keys and major attributes
Based on the results of the sample, import all rows with keys (primary, major indexed attributes) and main attributes. Also consider loading any major descriptive attributes associated with the primary key field, as for example, customer name and address attributes associated with a customer ID key field.
Step 4: Load all rows with additional attributes
You can now import other attributes, again selecting the primary key field as the row identifier. This allows you to compare the results across the data segment you import and to group results. This step may require several import operations depending on how many attributes are defined.
Step 5: Validate results
Revisit the dependency analysis for each of the data import groups. While attribute (column/field) metadata such as min, max, and frequency, are complete, dependency checks may be incomplete.
Dependency checks are performed only for those attributes included in a single load (import) operation. If a dependency exists between two attributes, but the attributes are imported in different load operations, the analysis will not show the dependency.
Also, drill-down functions only work within single entities. If you split data and create several entities for a single source entity, values may require regrouping.
RESTRICTIONS Imports only a subset of attributes. The analysis statistics for the imported attributes provide an incomplete profile. Follow the recommended process to get the best profiling results.
Precisely recommends splitting your data horizontally whenever possible, preserving all the attributes for a particular set of rows in a single import operation. It is less advisable to split data vertically.