Controls

Introduction

A set of Data Quality Controls is available:

Rows count

The Rows count control ensures that the number of records in your dataset matches the expected value. This control is crucial for validating data completeness and detecting any discrepancies in data extraction or loading processes. For example, if a dataset is expected to contain 10,000 rows but only 9,800 rows are found, this control will flag the issue for further investigation.

Missing value

The Missing value control identifies any records with missing or null values in critical fields. Missing data can lead to inaccurate analysis and poor decision-making. This control helps in pinpointing the exact fields and records where data is missing, allowing for prompt corrective actions.

Invalid format

The Invalid format control checks whether the data adheres to the expected format. This includes verifying the format of dates, phone numbers, email addresses, and other structured data. If a field expects a date in the format YYYY-MM-DD but finds DD/MM/YYYY, this control will flag the discrepancy.

Invalid text length

The Invalid text length control ensures that text fields contain the correct number of characters. This is important for fields with defined length constraints, such as IDs, codes, or other standardized text entries. Any deviations from the expected length are flagged for correction.

Invalid categorial value

The Invalid categorical value control checks whether data entries fall within the predefined set of allowed values. This is particularly important for categorical fields, such as country codes, product categories, or status indicators, ensuring that only valid entries are recorded.

Invalid number value range

The Invalid number value range control verifies that numerical data falls within acceptable bounds. This control helps identify out-of-range values that could indicate data entry errors or issues in data processing. For example, a field expecting an age between 0 and 120 will flag any values outside this range.

Invalid date value range

The Invalid date value range control ensures that date values are within a logical and acceptable range. This prevents errors such as future dates in historical records or dates that don’t align with the context of the data. Any date falling outside the expected range is flagged for review.

Unicity

The Unicity control ensures that specific fields, which are supposed to be unique (like IDs, email addresses, etc.), contain no duplicates. This is vital for maintaining the integrity of primary keys and ensuring that records are distinct and correctly referenced.

Outlier

The Outlier control identifies data points that deviate significantly from the norm, which may indicate errors or exceptional cases. Outliers are flagged for further investigation to determine whether they are legitimate data points or anomalies that need correction.

SQL (User defined) check

The SQL (User defined) check allows users to create custom SQL queries to validate data according to specific business rules or requirements. This control provides flexibility to enforce complex conditions that standard controls might not cover, enabling tailored data quality checks.