Getting Started with Building a Quality Data Pipeline | by Thomas Cardenas | Ancestry Product and Technology | Sep 2022

Summary: This article is for teams new to data pipelines to improve confidence in data outputs for better results. Below, I share our thoughts on the tools we used and what to do before using dependency data. The three examples we cover are external task sensors, nonexistent S3 location, and using Amazon Deequ to check for error levels that would cause the pipeline to fail.

I recently started building data pipelines that support machine learning inference. My team and I were new to data lakes, data pipelines, Apache Spark, Apache Airflow, and data engineering in general. We were tasked with creating a data pipeline training and inference for an index recommendation model.

This article describes that Ancestry manages tens of billions of clues for customers. The new data pipeline needed to efficiently extract a subset of tens of millions. Ancestry’s clues are continually being changed, which creates the added challenge of keeping track of accepted, rejected, and deleted clues for all subsets of clues we were using.

In creating this new pipeline, we hoped to focus on improving data quality by using targeted tools to troubleshoot issues. Our past integrations have seen pipelines fail frequently, missing or duplicate data, and misunderstood data sources. This challenge was an exciting opportunity for my team to build a pipeline in hopes of improving data quality.

  • Apache Airflow — pipeline planning and orchestration tool
  • Directed acyclic graph (DAG) – represents a series of orchestrated tasks
  • Apache Spark – general data processing framework for loading data and performing transformations is a common use case
  • Amazon EMR (Elastic MapReduce) — providing compute resources that can run various big data applications such as Apache Spark
Figure 1: Pipeline visualization of using external task sensors to connect separate data pipelines (Source: Author)

The dependency team used Airflow DAGs for each acquisition, and these DAGs lived on the same Airflow server as my team’s DAGs. When two DAGs have dependency relationships, the AirFlow External Task Sensor Operators can be used to check if they have completed or to alert if they have failed. Figure 1 above shows one task per data source to check other data pipelines to see if they are complete before continuing to the inference ETL.

Adding this gave my team a quick look at why the pipelines failed without having to open the logs. This also allows Airflow to create a dependency graph between all DAGs. It’s not always an option, but it’s nice.

Figure 2: Visual of the task sensor extension pipeline with the S3 list operator (Source: author)

External task sensors often indicated that the data acquisition DAG was successful, but our tasks failed. This is because the files in the S3 location did not exist due to the dependencies not pulling any lines.

Amazon S3 Operators is a package with S3ListOperator, which can quickly be wrapped in a custom operator to confirm that the files exist. The number of files or filenames doesn’t matter, just that they exist.

Adding this allowed us to diagnose issues faster. As we all know, the fewer clicks it takes to fix a problem, the better. It also prevented us from creating clusters (an expensive process) when the data was not there.

Figure 2 shows how adding the task can be implemented with dependency status checking.

Figure 3: Pipeline visual with added data constraint control (source: author)

Before using dependency data, it is crucial to ensure that all assumptions are correct. It is also essential to review the requirements of the final data deliverables to ensure that they are met. To do this, an operator submits a Spark stage to an EMR cluster which will use Deequ to validate these requirements and assumptions.

Assumptions can be things like number of rows in the expected range, non-null columns, unique columns, etc. There are several possible things to check, and Amazon Deequ is this check.

It allows to add Checks which are specific constraints on the data, such as ensuring that a column is not null, as mentioned above, or that the value is contained in an enumeration list. Error level checks would cause the pipeline to fail and progress to stop. There is also a warning level for less critical checks.

Check(CheckLevel.Error, "Schema Failure Constraints")
.isComplete("user_id")
.isComplete("job_title")
.isUnique("user_id")

Additional queries can be added to controls; these are called analyzers. We would like to add parsers for several distinct values ​​in a column. I use them frequently to create dashboards. They don’t cause our pipelines to fail.

var state = VerificationSuite()
.onData(data)
.addRequiredAnalyzers(Seq(Uniqueness("job_title")))

There is a third option for Anomaly detection. It’s an extra layer of quality checks that are broader rather than adding very specific constraints. This can include checking that the null count is within a certain percentage range or that the number of rows is +/- x number of rows. Like checks, anomalies can cause pipelines to fail.

VerificationSuite()
.onData(data)
.useRepository(metricsRepository)
.saveOrAppendResult(key)
.addAnomalyCheck(
RelativeRateOfChangeStrategy(maxRateIncrease = Some(100000.0)),
Size())
.run()

Deequ is not only used for data acquired from other teams; I use it after every transformation and before delivering data to clients. This is to ensure that no bugs were introduced at any point during the pipeline.

The team should check the logs when a failure occurs to see why the quality constraints failed. Deequ does not delete bad data. It only generates an analysis of the data profile.

Note: Deequ can verify that the rows exist, but it’s best to confirm that the data exists before running quality checks. Submitting a job to an EMR cluster just to have it fail due to missing data is an unnecessary expense.

There are a multitude of tools that offer a full range of functionality. What is described above does not cover all quality or observability, but it can help those who are new to improving data quality.

Our team is currently expanding the use of Deequ for data profiling and metrics reporting to create dashboards for easier metrics analysis.

If you want to join Ancestry, we’re hiring! Do not hesitate to consult our careers page for More information. Please also see our average page to see what Ancestry does and read more articles like this.

Comments are closed.