Dirty Data Flows Downstream: Fix It at the Source with Shift Left

9 min readFeb 11, 2025

We’ve built a system where every team hacks together its own data pipelines, reinventing the wheel for every use case. The result? Layers of redundant ETL jobs, cascading schema mismatches, and duplicated processing logic, creating a fragile, costly, and inefficient data ecosystem.

The burden of this inefficiency falls unfairly on data teams. Instead of building innovative solutions, they spend their time cleaning up after everyone else, fixing schema mismatches, reconciling duplicate records, and untangling the mess that gets worse with every new pipeline. It’s an exhausting, thankless cycle, and it’s not sustainable.

Instead of propagating this mess downstream, shift it left to the operational layer.

Do schema enforcement, deduplication, and transformation once, at the source, rather than five times in five different pipelines. Push processing upstream, closer to where the data is generated, rather than relying on a brittle patchwork of batch jobs.

For a deeper look at the challenges with traditional data architectures and why they are being reconsidered, check out this article on InfoQ.

The Traditional Data Pipeline: A Maintenance Nightmare

Multi-hop architectures are slow, costly, and error-prone. They depend on reactive data consumers pulling data, cleaning it, and shaping it after the fact.

*Multi-hop ELT to Reverse ETL to Push Insights Back to the Operational Estate*

Every team builds its own ETL, leading to:

Data duplication: The same data is transformed multiple times in different places.
Schema mismatches: Different teams define their own rules, leading to inconsistency.
Expensive compute costs: Redundant processing adds unnecessary load.
Slow insights: Waiting for batch jobs to finish delays decision-making.

The Unfair Burden on Data Teams

Today’s data teams are stuck in an endless loop of cleaning up messes they didn’t create. Instead of being strategic partners in business innovation, they’re treated like janitors for the data pipeline.

Dirty data flows downstream — raw, messy data cascades through multiple pipelines, forcing every team to solve the same problems over and over. If data needs to be deduplicated, every consumer must deduplicate it independently, leading to:

Wasted compute: Running deduplication logic across multiple pipelines increases cloud costs.
Resource inefficiency: Engineers are reinventing the wheel rather than focusing on innovation.
Inconsistent results: Different teams may apply different deduplication logic, leading to discrepancies.

Instead of forcing data teams to fix the same problems repeatedly, the solution is simple: shift these responsibilities left and enforce data quality at the source.

Shift Left: Learning from Other Movements

Shifting left has transformed multiple domains in software engineering by addressing problems earlier in the process, reducing cost, and improving efficiency. This principle has already proven invaluable in:

Testing & QA

Early in my career, I would write software and then throw it over the fence to a human QA team. They would manually test it, find bugs, and throw it back over the fence to me to fix.

This back-and-forth was slow, expensive, and grossly inefficient.

Initially, testing was done post-build, resulting in costly bug fixes, long release cycles, and unpredictable failures in production. When automated testing was first proposed as a way to shift testing left, many engineers resisted. They saw it as extra work, a distraction from building features.

But as testing became part of the development process, it became clear how much time and pain it saved. Today, automated testing is the standard because it prevents costly mistakes from reaching production.

Security & Privacy

Similar, security was once a disconnected afterthought, with audits and vulnerability assessments performed only before a release.

The result? Critical flaws discovered too late, leading to expensive remediation efforts and security breaches.

Shift Left Security (DevSecOps) embeds security checks early, ensuring vulnerabilities are detected and mitigated before they cause harm.

Similarly, privacy concerns have traditionally been addressed as a reactive compliance burden, patched in at the last moment. But Shift Left Privacy means building privacy into the architecture from the beginning.

The Shift Left Data Approach

Just as testing, security, and privacy improved by shifting left, data engineering must adopt the same approach. Today, data teams inherit messy, inconsistent data, forcing them to clean it up in costly downstream ETL pipelines.

The solution is to Shift Left by enforcing schema, deduplication, and transformation at the source instead of cleaning up after the fact.

Instead of cleaning up data problems downstream, prevent them from happening in the first place. This means:

Treating data like an API contract: Producers define structured, well-formed event streams.
Moving transformation upstream: Instead of batch ETL jobs, data is processed in real-time as it’s generated.
Using stream processing (Flink, Kafka Streams) instead of relying on brittle batch jobs.

With this approach, data quality is ensured at the point of creation, rather than being cleaned up later.

Data as a Product

A key part of shifting left is adopting the data-as-a-product mindset.

A data product isn’t just another dataset, it’s a well-defined, reusable, high-quality asset designed for long-term usability and reliability. Instead of raw, messy data flowing through pipelines, a data product delivers structured, enriched, and ready-to-use data that serves multiple use cases across an organization.

This shift mirrors transformations in other areas of software development.

Think about how documentation-as-a-product changed the way teams manage technical knowledge, turning documentation into a first-class asset rather than an afterthought.

Data-as-a-product follows the same principle: data should be designed, maintained, and served with the same rigor as production software.

By formalizing ownership, versioning, and usability standards, data products provide a consistent, scalable approach to data management. That means fewer headaches, less rework, and a foundation of trust in the data teams rely on every day.

A data product can take various forms, such as:

A real-time event stream (Kafka topic) that provides clean, enriched events.
A queryable table (Iceberg, Delta Lake) that integrates structured data from multiple sources.
An API that serves validated, up-to-date data for applications and analytics.

*The Data Product is Driving Both Analytical and Operational Use Cases*

Model First, Pipeline Second, Party Third

Matthew O’Keefe’s article on shift left emphasizes a key principle: data modeling comes first.

Without a structured model, data pipelines become brittle.
Standardized schemas prevent unnecessary rework.
Event-driven models enable real-time, high-quality data products that can be used directly.

Instead of treating data as an afterthought, start with a clear data model, then build pipelines that respect it, and finally, enjoy seamless, scalable data operations.

Answering Key Questions and Concerns

As organizations move toward a Shift Left approach for data, some common questions arise. Here’s how to think about the transition and its impact.

What’s the ROI? How Do We Justify This Change?

Shifting left doesn’t mean refactoring 10,000 ETL jobs overnight. Instead:

Start with high-value, high-pain areas where data quality issues create real business problems.
Introduce real-time data products alongside batch workflows — gradually embedding them into your data stack.
Prove ROI by eliminating redundant ETL, cutting batch compute costs, and enabling faster, more accurate data.

Legacy pipelines already cost millions in rework, delays, and compliance risk. Poor data management often leads to breaches, which have direct financial and reputational costs.

Businesses need to ask: how long can we afford to keep patching the same problems before competitors pass us by?

Who Owns Data Quality and Standardization?

In an ideal setup:

Domain teams own their data products, ensuring accuracy at the source.
Data engineers shift from reactive cleanup to defining data contracts and validation rules upfront.
Governance focuses on standards and protocols, rather than approvals and batch cleanups.

Instead of centralized teams acting as bottlenecks, they become enablers — helping teams produce well-formed, high-quality data from the start.

Is This Just ELT With a Fancy Name?

Not quite.

ELT loads raw data into a warehouse first, then transforms it downstream. Shift Left ensures transformation happens before data hits the lake, reducing redundant processing.

To see why this matters, consider an example.

A company may collect billions of click events from its website, streaming them into a central data lake. Each team that needs to use this data, marketing, product analytics, customer experience, starts by creating their own copy and writing their own deduplication logic to clean it up. This means multiple teams are doing the exact same work in different places, wasting compute, storage, and engineering effort.

With Shift Left, deduplication happens once at the source before the data even reaches the lake. Instead of every team writing their own cleanup scripts, they consume pre-cleaned, standardized event streams. This lowers costs, drives faster insights, and better consistency across all analytics.

This approach cuts down on wasted compute cycles and enables real-time analytics without the friction of massive batch jobs.

What Happens When Teams Don’t Know Their Data Rules?

Let’s take a real-world scenario.

Imagine the Marketing team needs to publish customer engagement data. Ideally, they define a well-structured data product that includes business rules for deduplication, standardization, and enrichment. A Kafka connector could then stream that dataset, applying data quality (DQ) and data contracts (DC) with stream processing before ingestion into the warehouse or data lake.

But what happens if the Marketing team doesn’t have a data expert or clear rules?

Domain teams should own their data products, ensuring consistency at the source.
Data engineers shift from fixing ETL failures to partnering with teams, not to write ETL for them, but to help structure contracts, validation, and publishing mechanisms.
If no clear rules exist, a minimum viable contract (standard identifiers, timestamps, and formats) is required so the data remains well-formed and traceable.

A common misconception is that dumping everything into a warehouse solves the lack of domain expertise. It doesn’t.

Collecting raw data and hoping it will be cleaned later doesn’t solve the lack of domain expertise, it just defers the problem. If a team doesn’t understand their data enough to define it upstream, they will struggle to extract value from it downstream, leading to the same inconsistent mess.

Shift Left forces teams to think about data quality from the start, reducing downstream complexity and rework.

How Is This Different from a Data Mesh Approach?

Data Mesh and Shift Left share some ideas, both advocate for decentralizing data ownership and making teams responsible for their data. However, Shift Left is more about execution, it ensures that data is validated, transformed, and enriched before it moves downstream.

Key differences:

Data Mesh is primarily an organizational model, it distributes data ownership across teams but doesn’t prescribe how data should be processed.
Shift Left is an execution model, it ensures that data is cleaned, structured, and standardized at the source before it reaches consumers.

You can adopt Data Mesh without Shift Left, but you’ll still have the same data quality issues if transformation happens too late. Shift Left makes a Data Mesh approach more effective by reducing redundant work across pipelines.

Won’t This Overload Transactional Systems?

No, because:

Real-time stream processing (Flink, Kafka Streams) handles transformation, not the OLTP database.
CDC extracts updates non-intrusively, ensuring minimal impact on transactional workloads.
Event-driven validation ensures data is structured correctly before landing in a warehouse or lake.

The goal isn’t to push analytical workloads onto OLTP databases, it’s to validate and enrich data as it moves, leveraging event-driven architectures to handle scale efficiently.

By addressing these concerns directly, organizations can confidently move toward a Shift Left model that optimizes their data infrastructure for the future.

The Future of Data Engineering is Shift Left

The old way, letting every team fend for themselves, writing brittle ETL for a dozen variations of the same dataset, creates a maintenance nightmare and is unfair to the data teams that get stuck with disentangling the mess. Shift left. Make clean, high-quality data a first-class product, not an afterthought.

Shifting left means treating data as a product, not an afterthought.

Standardization and transformation should happen at the source, not after the fact.
Governance should be proactive, using schema registries and contracts instead of cleanup teams.
Data teams should be in the critical path, not relegated to janitorial work.
Just like in security and DevOps, companies that adopt Shift Left will outpace their competition.

No one studied computer science so they could spend their work life cleaning data. So, why are we still defending architectures and processes built for the constraints of 20 years ago?