Tutorial - Creating internal data quality scores - Data360_Govern - Preview

Data360 Govern Help

Product type
Software
Portfolio
Verify
Product family
Data360
Product
Data360 Govern
Precisely Data Integrity Suite > Govern
Version
Preview
Language
English
Product name
Data360 Govern
Title
Data360 Govern Help
Copyright
2024
First publish date
2014

This topic guides you through the steps that are needed to prepare and configure internal data quality scores.

Before you configure a score

Before you start to configure your data quality score calculation, it's important that you know how you want your score to be determined. The data quality score of any asset is made up of data quality rule results, for an effective date. You need to understand the data quality rules that are in place, the assets they apply to, and the importance of each. Here's a list of general questions that will assist you in formulating your configuration:

  • What assets do you want to score?

    For example, all assets of the Database > Schema > Table > Column type.

  • Do your data quality rules all belong to one rule type, or are there more than one type?

  • Are some data quality rules more important than others and, if so, should they contribute more to the score?

    For example, Is a rule with a quality dimension of "Conformity" more important than a rule for "Duplication"?

  • Is the importance of the data quality result dependent on what kind of asset it is?

    For example, if a column is marked as a "Critical Data Element", should the "Accuracy" rule hold more weight than the "Null Count" rule?

These are the type of questions that should be answered and understood before you start, as each has a significant impact on how you will configure the scoring measures.

Step 1: Establish relationship types and relationships

For a data quality rule result to be taken into account in the score of an asset, the asset must be related to a rule, either directly or indirectly, through the "Evaluation" predicate functional type.

Because data quality rules are run at the lowest level of a technical asset (like a column, or a data element) rules should be directly related in that case. For example, Rules evaluate Columns.

Indirectly relating to the rule means that at a higher level, there is a relationship path from the asset being scored, to an asset being evaluated by rules. As long as a direct relationship exists, then higher level assets can be scored by selecting the relationship path that ends in an evaluation predicate and rule type.

You must establish the direct relationship from an asset to a rule. Govern will then determine the indirect relationship paths for higher level assets, based on existing relationships.

For example:

In order to score a business term that maps to columns, you must first establish the relationship between columns and rules:

  1. Create a rule type and individual rules.

  2. Create a relationship type "Rules evaluate Columns".

    When creating a score definition for Business Terms, the rule results selection will include the relationship path "Business Terms maps to columns, which are evaluated by rules".

    For more information, see Structure of a data quality measure.

Creating Relationships

Besides creating the relationship type, actual relationships must be established, before rule results are posted and used in any score calculation. You need to determine the columns that are evaluated, by which rules, and establish the appropriate relationships.

If rule results are posted for an asset, but no relationship between the asset and the rule has been established, those rule results will not be considered in the calculation of the data quality score.

It is vital that you:

  1. Determine the assets that are evaluated, by which rules.
  2. Create the relationship with the evaluation predicate.

For example:

The "Account Number" column has 3 data quality rules, which run against it every week:

  • Rule 1 has a dimension of "Conformity".
  • Rule 2 has a dimension of "Completeness".
  • Rule 3 with a dimension of "Accuracy".

For results from all three rules to be considered, you must relate the "Account Number" column to "Rule 1", "Rule 2", and "Rule 3".

Tip: If, for whatever reason, data quality rules are run and the results are posted for an asset, but you do not want those to ever be used in the score calculation, don't establish a relationship between that rule and the asset. Results can still be posted, but they will never be used in the data quality score.

Step 2: Build the data quality scoring definition and measures

Once a relationship type is established for a rule type to evaluate an asset type, you are ready to configure the data quality measures.

The main difference between a governance measure and a data quality measure, is that the result of the measure is a number, rather than true or false. When a governance measure is evaluated, the end result is either:

  • The criteria in the pass test was met (true).

    or:

  • The criteria in the pass test was not met (false).

The weight or adjusted weight is then used as the measure's contribution to the score.

When it comes to a data quality measure, the measure result is a number, which is then used in conjunction with the weight, to determine the contribution to the score.

  • Measure result: The result of the measure, based on the configuration.
  • Rule result: The result of running a data quality rule, which includes the number of records passed and failed.

This means:

  1. Create the Scoring definitions.

  2. Create the measures.

Structure of a data quality measure

Measures

1) Enter the basic measure information.

2) Select the rule results to use in the calculation and the operator.

3) Define when and how to apply the measure to which assets, and how to weight the measure result.

4) Default weight of the measure.

Basic measure information (1)

The basic information for a data quality measure is the same as for a governance measure. The weight entered is the default weight for the measure, which is only overridden if there are condition groups established within the measure.

For more information, see Scoring definitions, Asset conditions and weighting.

Rule results section (2)

The Rule Results section is key. It is used to define the rule results that are used in the calculation of the measure's result. While you must have a relationship established between the rule and the asset being scored, for the result to even be considered, the rule result section is where you further refine the results that are used for a particular measure.

Rule results selection

Under Rule Results Selection, you can find all the potential direct and indirect paths from the asset being scored, to the asset being evaluated by the rules. This tells the system how to get to the rule results, through a relationship path from one asset to another.

The higher the level of the asset, the longer the path is, so make sure you understand the relationships, as well as how you want to get down to the rules.

Results operation

Once the result selection has been made, you choose a Results Operation. This is the operation that is performed on the rule results pass fraction, for the relevant measure on the effective date.

Example 1:

A column is evaluated by three rules, and on April 1, 2021, results were posted that gave the following pass fractions:

  • Rule 1 (Conformity) pass fraction = 0.92
  • Rule 2 (Completeness) pass fraction = 0.93
  • Rule 3 (Accuracy) pass fraction = 0.94

If the result operation = Average, then the measure result would equal 0.93.

If the result operation = Minimum, then the measure result would equal 0.92.

If the result operation = Maximum, then the measure result would equal 0.94.

Rule Result Filters

While the Rule Results section gives the path to rule results and the operation to perform on the pass fraction, the Rule Result Filters allow you to further refine the rule results that are to be used in the measure. The filters are mainly applied when you want to weight rules differently, depending on a property of the rule itself, such as "Dimension".

For higher level assets, or assets that have more than one relationship in the path, filters can be used to weight the results differently, based on the properties of an asset in the middle of the relationship path.

Example 2:

The above the calculations in Example 1 are again used, but this time, the rule result filter applied is Dimension = Conformity. Then, just the results of Rule 1 would be used in determining the result of the measure for that column.

Note: If filtering the rule results ends up with only one rule result to use, then that pass fraction will be used in the measure result. The Result Operation is does not come into play, if there's only one rule result.

The following diagram shows the rule results that would be used to score Column 1 and Column 2, when the rule result filter is Dimension = Conformity. One result is found for Column 1, and one for Column 2.

Dimension = Conformity

Example 3:

If you are scoring a business term that maps to columns, which in turn are evaluated by rules, the Rule Result Filters can be used to weight the results, based on rule or column properties.

If you apply the same Dimension = Conformity result filter to the Business Term A measure, then the rule would again deliver one result for Column 1, and one for Column 2:

Business Term A

The result operations average, minimum and maximum are performed on the two results delivered by Dimension = Conformity.

Example 4:

If you apply the "Column: Critical Data Element: Yes" rule result filter, these results are then used to calculate the result of the Business Term A measure:

Critical Data Element

The result operations average, minimum and maximum are performed on the three results delivered for Column 1.

The Rule Results section of the measure determines the rule results to use, and what operation to perform on the results found.

Asset conditions and weighting (3)

Here, you determine when to apply the measure, and whether certain assets being scored should use a different weight.

Asset conditions and weighting work the same in data quality measures as they do with governance measures. They determine if the measure should apply at all, and if so, whether the weight of the measure differs according to different asset properties. For more information, see Scoring definitions, Asset conditions and weighting.

A simple example of using conditions in the data quality score, is when you only want to score columns that are critical data elements. In this case, you put a condition on the measure stating "Critical Data Element = True".

For more examples of internally calculated data quality score definitions, see Internally calculated data quality score examples.

Step 3: Post rule results

Once you've created the relationships and configured the measure, it's time to get a score.

The data quality score calculation is triggered in the following scenarios:

  • When a rule result is posted, updated, or deleted.
  • When a scoring measure is created and rule results exist for the measures effective date.

Posting the results

Depending on where your data quality rules are run, posting the results is simple and done via the /api/v2/metrics/quality/results API. If you're using Swagger, you can find it under the Metrics section.

Existing rule results

If you have existing rule results and want to use those to calculate the data quality score, there are a few things to consider for historical dates.

  • The rule result must have an asset tied to it.

    Prior to the data quality score being made available, you could post rule results for a rule, but you didn't need to show the asset that the result was for. You can use the PUT operation in the request API, to update any existing rule results with the appropriate asset.

  • The effective date of the result must be on or after the effective date of the measure.

    If you have rule results dating back to January 1, 2020, and you set up a measure with an effective date of February 2, 2021, then the system will look for rule results with effective dates on or after February 2, 2021.

  • Both the relationship type and the relationships from the asset to the rules must be established, prior to setting up the measure.

Tips and suggestions

When do I create a new measure?

Part of configuring a data quality score is understanding when you need to create a different measure. Here are some scenarios that help determine when it may be necessary.

  • You can create a data quality score with only one measure, if that meets your criteria for calculating the score.

    That measure takes the average of all rule results for an asset, on the effective date. This scenario assumes you have your data quality rules under one rule type.

  • If your data quality rules are under different rule types, you need a different measure for each rule type.

    This is because you will have a different relationship type to the asset being scored, for each rule type.

  • If you want different rule dimensions to be weighted differently, set up a measure for each dimension, using the rule result filters.
  • If you want the same rule dimension to be weighted differently for different assets, you can achieve this with asset weighting and conditions.

Understanding the rule results used in a measure result

The Calculation sub-tab, on the scoring tab for a data quality score, has a "Show Rule Results" option. This displays a list of the rule results that were used in the calculation of the measure result.

When a rule result is posted, a new score will be calculated with the effective date of rule result

There is an assumption in the data quality score that data quality rules are run for an asset on a specific day. This means that at the column level, you can expect rule results to come in for an effective date, which will be the effective date of the score.

In the case where a higher level asset is being scored, which in turn maps to several different lower level assets such as columns, all the rules may not be run on the same day. You may get results that are used in the calculation of a score throughout the week. In that case, the system will bring forward the previous rule results, if available. As a result, you may see different effective dates, when you click "Show Rule Results".

For example:

Business Term A maps to Table 1 > Column 1 and Table 2 > Column 2. Data quality rules for Column 1 are run on April 1, while they are run on April 5 for Column 2.

  • When rule results are posted for Column 1 on April 1, a score will be calculated for April 1. It will count Column 2 results as zero (0), because the score is expecting rule results for Column 1 and Column 2.
  • When rule results are posted for Column 2 on April 5, a score will be calculated for April 5. It will use the rule results from April 1 for Column 1, and April 5 for Column 2. When you click "Show Rule Results" for April 5, you will see the two different effective dates used.