Data Quality checks (DQC) in Data Engineering

Nitish Kaushik
2 min readJan 25, 2023

There are several types of data quality checks that can be performed in an ETL pipeline, including:

  1. Syntax checks: Ensure that data is in the correct format and conforms to the schema of the target system.
  2. Domain checks: Validate that data falls within a specified range or set of acceptable values.
  3. Integrity checks: Verify that data relationships and constraints are maintained, such as foreign key relationships in a database.
  4. Completeness checks: Ensure that all required fields are present and contain non-null values.
  5. Consistency checks: Compare data across multiple sources to identify and resolve discrepancies.
  6. Accuracy checks: Verify that data is correct and free of errors, such as incorrect dates or invalid zip codes.
  7. Format checks: Validate that data is in the correct format, such as checking that phone numbers are in a specific format.
  8. Uniqueness checks: Ensure that there are no duplicate records in the data.
  9. Timeliness checks: Verify that data is current and up-to-date.

These are some common checks but depending on the requirement and data, more checks can be added to the pipeline accordingly.

Data Quality Checks (DQC) Architecture

The architecture for data quality checks in an ETL pipeline typically includes the following components:

  1. Data sources: These are the systems or databases which the data originates from.
  2. Data extraction: This component extracts the data from the sources and prepares it for loading into the target system.
  3. Data validation: This component performs various checks on the data to ensure that it is valid and conforms to the desired quality standards.
  4. Data transformation: This component modifies the data as needed to fit the schema of the target system.
  5. Data loading: This component loads the data into the target system, such as a data warehouse or data lake.
  6. Data monitoring: This component monitors the data in the target system to detect and alert on any issues or quality issues.
  7. Data governance: This component defines the policies, procedures, and standards to ensure data quality, security, and compliance.
  8. Data Cleansing: This component is used to clean the data from the data sources before loading it into the target system.
  9. Metadata management: This component captures and maintains the data about the data, including data quality rules, data lineage, and data catalog.

These components work together to ensure that the data is of high quality and ready for analysis and reporting. The architecture is flexible and can be customized to meet the specific needs of the organization.

--

--