Data Quality

The data quality module allows the definition and implementation of quality rules, integrating with the glossary of concepts at the definition level and the data catalog at the implementation level.

Quality rules list

Accessing the quality module we will obtain a list of quality rules. In this list, we will be able to see which concept the rule refers to (in case of referring to a concept), the name, quality objective, threshold, last result and date of last execution. These will be the columns that will be displayed by default but can be customized in each installation.

You can perform a search or filter said list by active / inactive rules, the concept to which they belong, domain to which the concept belongs, and execution result. Additionally, there will be dynamic filters with those filterable fields that have been added in the quality rules template.

Quality rules

Definition from a business point of view. Here we must add why the use of the rule, how the quality of this data affects the business, description of how the rule should be implemented and any other data that is considered relevant from a business / functional point of view. Quality rules can measure the degree of compliance based on a percentage or absolute number of errors.

Quality rules can be defined from two different points:

A quality rule consists of fixed fields for any installation and fields that may be customized in each installation using the template management feature:

Mandatory Fields

  • Name: Name that identifies the validation and that will be displayed when this rule appears in a list. The objective of this field is to be able to quickly identify the validation in question.

  • Description: Detailed description of how this validation should be performed and what we want to obtain as a result. A rich text field will be available to enter this description.

  • Domain: Domain in which the rule will be stored. This will be important in order to define who has permission to alter the rule, create implementations, execute implementations.

  • Concept: Optionally you will be able to select a concept to which the rule is applied.

  • Type of result: We will define whether with this quality rule we want to measure compliance based on a quality percentage or an absolute number of errors.

  • Threshold value: Minimum value resulting from the execution of the rule. Below this value we will consider that a quality error has occurred.

    • Quality percentage: The measure will be the % of records that match the quality criteria. Values will be between 0% and 100%, being 100% the maximum quality of our data.

    • Deviation: Used to compare two counts/amounts that should be similar. It will give the % of difference between the data to be checked and the data to be checked with. Values will be between 0% and 100%, having 0% as the max quality for our data.

    • Erros Number: Used to check the absolut number of errors in our data without dependency on the volume of our data. Very useful in cases with high volume of records and small margin of error. Values will be positive integers, having 0 errors as the best quality possible.

  • Objective value: Value to be reached for the defined rule. Between the threshold value and the target value, we will consider that a quality alarm occurs.

  • Active: It allows us to define if this rule must be executed currently or we want it to be disabled.

Customized fields

In each installation, the required fields can be configured through template management.

Quality implementations

Although the quality rule defines which validation to be performed in an implementation, we will specify how this definition should be applied to our data. Take into account that the same rule can have several implementations since we may want to apply it to different data within our systems.

You will be able to access implementations using two different ways:

  • List of all implementations through the sidebar menu with download data, execute quality, search and filter capabilities.

  • List of implementations related to a quality rule as tab inside the quality rule.

In order to create a new implementation you will need to go to the quality rule details since all implementations need to be linked to a quality rule.

Within the creation of new implementations, depending on the permissions of each user, two types of implementations can be registered.

Quality Implementations setup

The implementation will be created in four simple steps

Information

Input the information associated to the quality implementation. You may introduce:

  • Executable: By default all implementations are available to be executed by truedat connectors. In case that you dont want a specific implementation to be executed by truedat's quality engine just uncheck this option. This is useful in case that you have integrated an external quality engine and don't want truedat to try to execute something that is already being done by a third party.

  • Implementation key: Defines a unique indentifier. You may not use an existing identifier. In case that you do not input any value an identifier will be autogenerated

  • Dynamic information: Fill out the information defined by your Quality Implementation template in case that you have defined one.

Data set

The data set on which we are going to act is defined. To do this we must select one or more structures from the data catalog. In case of selecting more than one structure, we must specify which field is used to join both tables. This will be our initial set of data.

It will be possible to join information from several tables. For this, it will be necessary to select which are the fields of both structures that have to be used to make the union.

Population

A series of filters can be defined to limit the data that we want to validate within the initial set defined. To define these filters, use operators should be used in quality implementations. It is not mandatory to include filters, so validation can be performed on the complete set of data selected in the previous step.

Validation

Using the available operators, the conditions to be applied to the validation must be defined. It will be mandatory to define at least one validation to perform. The measure that will be obtained when this implementation is executed will be defined by the number of records that meet the validations specified here

Once these three steps are completed we will have created an implementation. For said implementation to be executed in an automated way, we must have integrated into our installation a quality engine for the system on which we want to execute said executions.

Operators

For both the population step and the validation step, we will use some operators defined in the application. The product comes with some default operators but in each installation the operators to be used can be customized. It is important to emphasize that any change in the operators will imply changes in the quality motor if it is integrated into our installation.

An operator will always be applied to a field in the selected data set and may or may not have, depending on the operator, additional parameters.

Operators have the following characteristics:

  • Data types: The data type will be displayed when selecting a field both in the population step and in validation. Depending on the selected type, we will have valid operators available for that type. Examples

    • Number: For a data type of number we will have operators of the type greater, less or equal, etc.

    • Text: For a data type with text format we will have operators that compare the length of said text.

    • Date: For a type of date data we will have an operator that will allow us to know if it is the last day of the month.

  • Scope: There will be operators that are available both in the population step and in the validation step, but others that do not make sense so they will only be available in one of the two steps. For example: The operator to check the format of a data is found in the validation but not the population filter.

  • Groupings: For a better understanding in the selection of the operator, some of them are shown grouped.

  • Parameters: Depending on the operator, a series of parameters may be defined, which may be values ​​entered by the user or other fields of the selected data set. Examples:

    • No parameters: The "Is empty" operator does not need parameters.

    • One parameter: The "Is greater than" operator will require the user to enter the minimum value to be checked.

    • Two parameters: The "Between" operator will require the user to enter the minimum and maximum values ​​to be checked

    • Value of a given list: The operator "Has a format of" will show us a drop-down list with the available formats to check: date, number, DNI, etc.

    • Another field of the selected data set. The "equals field" operator will require the user to select a field from the data set. To do this, a drop-down will be displayed where you can search and select the indicated field.

    • Another field in the data catalog: The "Referenced in" operator will allow you to select any field in the data catalog to perform a referential integrity test.

Native implementations

Open source implementations can be registered. The person who discharges the implementation must know what system is going to be executed in order to introduce a correct syntax in the target system. In this type of implementations, the following values ​​must be filled in.

  • Implementation Key: Defines a unique indentifier. You may not use an existing identifier. Identifiers do not allow spaces or restricted characters. In case that you do not input any value an identifier will be autogenerated

  • Dynamic Information: As defined by your implementation template if any is defined

  • Data Source: On which the validation is going to be executed.

  • Database: In case that the data source needs a data base to be selected.

  • Dataset: The data set on which you want to perform the validation is defined. In an SQL statement in this field we would include the FROM section of the query where joins and aliases can be entered for the tables.

  • Population: The filter to be performed on the data defined in the previous point is introduced. This will be used to define a subset of the data on which you want to perform data validation. In an SQL system this field will have to have a valid syntax to enter in the WHERE of a query.

  • Validation: You enter the validation you want to perform on the data selected in the data field and filtered in the population field. In an SQL system this field will have to have a valid syntax to enter in the WHERE of a query.

Modifying implementations

Modifications can be made on implementations that do not have quality results loaded. For those implementations that have quality results, it cannot be modified as it would generate an inconsistency between the definition of the implementation and the results obtained. To modify an implementation, enter the implementation detail and click on the corresponding option.

Create duplicates of existing quality implementations

There is the option to clone existing quality implementations, which will allow us to define a new variant of an existing quality implementation in a simple way:

  • The new quality implementation will inherit all parameters from the previous implementation except the implementation key, as a new key will have to be defined for this new implementation.

  • The new implementation will match the same rule as the implementation it came from.

Implementations deprecation

You will be able to deactivate your quality implementations. This will avoid them to be executed through the scheduled process but mantaining the information for your dashboards. In case that you remove the implementation al historic information will be removed.

Once deprecated you will be able to restore your implementation in case that it is needed to have it restored.

Execution of quality rules

The application has mechanisms to enable a user with permissions to request the execution of quality implementations. In order to have these implementations run the data source needs to be correctly setup with data access.

Users with permission to run implementations will see a selector on the implemenetations screen. Select the implementations to run and press the corresponding action.

Once the exection has been requested user will be redirected to a screen where the progress of the execution can be monitored. Refresh the screen to see the status.

Quality execution results

The data quality module is prepared to receive and store quality execution results, being able to display them to the user on the screen. In case of receiving results, the result of each of the implementations of the rule will be displayed as well as an aggregated result for the rule.

Additionally, by clicking on an implementation we will be able to see its details including the history of all its executions.

Admin users will have the ability to delete quality rule results. This shouldn't happen often, but this option can help you eliminate erroneous uploads:

Quality execution errors

In case that latest execution for an implementation has finished with an error this will be displayed both in the implementations list and in the implementation detail to allow the user to identify the problem and fix it.

Navigation

The full path of those data structures selected when registering the implementation will be shown. Additionally, from the implementation detail, it is possible to navigate to the corresponding structures of the data catalog.

Notifications

Users will be allowed to subscribe to a quality rule, choosing what type of events they want to watch. In case that the chosen event is produced over the given quality rule, the user will receive a notification with the chosen time period.

Notifications periodicity:

  • Real Time: User will receive the notification when the result is received in truedat.

  • Hourly: Each hour the user will receive an email with all the results received in truedat matching the configuration.

  • Daily: User will receive an ingest email with all results that match the given criteria.

Notify results: User will select what type of results are to be sent.

  • Goal: Results that are above the goal

  • < Goal: Results that are between threshold and goal.

  • < Threshold: Results that are bellow the threshold.

Last updated