Creating a Subset of Data - trillium_discovery - trillium_quality

Creating a subset of data is likely to have the biggest impact on performance since it includes profiling performance (data load/analysis) and subsequent Quality processing. We recommend sampling by creating a subset of data whenever possible.

You can create a data subset by customizing the data and/or sampling the number of rows when you create an entity. For best results, ensure that the data subset is a consistent sampling of data across all the entities you plan to load into the repository.

Note: If the data sample is inconsistent, the resulting data analysis will not be representative of the data in the data source.

To create a subset of data by customizing the data

Open the Create Entity Wizard to create an entity.
In the wizard, select the data file and the schema file and click Preview.
Use the Data Rows window to customize the data that will be loaded.
- Right-click a column header and select Hide to remove the column from the data load.
- Right-click anywhere in the column header and select Choose Columns. Select the attributes you want to hide or change the order of columns by dragging attribute names to the correct location.
- Right-click anywhere and select Filter. Build an expression that defines the criteria you want applied to the rows of data.
You can find more information in the TSS Help on how to build expressions (Discover > Data Compliance Using Business Rules > Expression Builder).

Note: If your source is a relational database, you can filter the data by applying an SQL filter. You can find information in the TSS Help on how to apply a SQL filter (Filtering Relational Database Data).
Close the Data Rows window.

Changes you made to the data in Preview mode are preserved. When you load the data from the source into your entity, hidden columns will not be loaded, only selected rows will be loaded, and the columns will be arranged as you configured them in the Preview mode.

To create a subset of data by sampling rows

In the Create Entity Wizard, go to the Load Parameters settings and select one of the following options:
- First [number] rows. Load a selected number of rows from the beginning of the entity (for example, the first 1000 rows).
- Random [percent]% sample. Randomly sample a percentage of rows from the entity.
- Skip first [number] rows. Allows you to specify a starting row for data loads. For example, if your entity has 300 rows and you select All Rows and Skip first 99 rows, TSS will load 201 rows, starting with the 100th row.
Click Next.

When you load the data from the source into your entity, only selected rows will be loaded.

Processing a small subset through the Control Center may not give you the profiling results you need. If you want to profile the final output, consider adding the Analysis process to the final output of the project.

You can find more information in the TSS Help about the Analysis process (Develop > TS Quality Processes > Profiling Processes > Using Analysis Processes).

Creating a Subset of Data - trillium_discovery - trillium_quality - 17.1

Trillium Beyond The Basics Guide