7 Data Pipeline Mistakes That Cost You Thousands
Common data engineering pitfalls we've seen (and fixed) across 15+ client projects. From schema drift to cost overruns.
Data pipelines fail quietly. A broken ETL job runs for weeks, analysts trust stale dashboards, and by the time you notice, you've lost customer trust—or spent $10K on runaway queries.
We've built data platforms for e-commerce, fintech, and SaaS companies. Here are the 7 mistakes we see repeatedly (and how to avoid them).
1. No Schema Validation
Mistake: Your API changes a field from string
to int
. Your pipeline ingests it silently. Downstream jobs break days later.
Fix: Validate schemas at ingestion. Use tools like:
- Great Expectations (Python) or dbt tests (SQL)
- Enforce contracts between producers and consumers
Example (dbt test):
-- models/schema.yml
version: 2
models:
- name: orders
columns:
- name: order_id
tests:
- unique
- not_null
- name: amount
tests:
- not_null
- dbt_utils.accepted_range:
min_value: 0
Why it matters: Catch bad data early. Fail fast instead of corrupting downstream tables.
2. Full Table Scans Every Day
Mistake: Your nightly ETL job reads the entire users
table (10M rows) even though only 100 rows changed.
Fix: Use incremental loads. Track updated_at
timestamps or change data capture (CDC).
Example (dbt incremental):
{{ config(
materialized='incremental',
unique_key='user_id'
) }}
SELECT * FROM raw.users
{% if is_incremental() %}
WHERE updated_at > (SELECT MAX(updated_at) FROM {{ this }})
{% endif %}
Why it matters: Faster pipelines. Lower compute costs. Snowflake/BigQuery charge by data scanned.
3. No Idempotency
Mistake: Your pipeline runs twice (Airflow retry, manual trigger). Now you have duplicate rows.
Fix: Make pipelines idempotent. Use MERGE
(upsert) instead of INSERT
. Track run IDs.
Example (Postgres):
INSERT INTO orders (order_id, amount, updated_at)
VALUES (123, 99.99, NOW())
ON CONFLICT (order_id)
DO UPDATE SET amount = EXCLUDED.amount, updated_at = EXCLUDED.updated_at;
Why it matters: Retries are safe. No manual cleanup after failures.
4. Ignoring Cost Monitoring
Mistake: You set up a BigQuery pipeline. Three months later, your bill is $8K. Turns out a dev was SELECT *
on a 5TB table for testing.
Fix:
- Set query cost alerts (BigQuery, Snowflake dashboards)
- Use
LIMIT
in dev. Partition tables by date. - Review most expensive queries monthly.
Example (BigQuery cost tracking):
SELECT
user_email,
SUM(total_bytes_billed) / POW(10, 12) AS tb_billed,
SUM(total_slot_ms) / (1000 * 60 * 60) AS slot_hours
FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
GROUP BY user_email
ORDER BY tb_billed DESC
LIMIT 10;
Why it matters: Catch cost overruns before they're a budget crisis.
5. No Data Lineage
Mistake: A metric changes. No one knows which upstream table caused it. You spend 2 hours tracing joins.
Fix: Document lineage. Tools:
- dbt docs (auto-generates lineage graphs)
- OpenLineage (open standard)
- Airflow + custom DAG metadata
Why it matters: Debug faster. Trust your data.
6. Testing Only in Production
Mistake: You deploy a new dbt model. It breaks the CEO's dashboard. You rollback at 11 PM.
Fix: CI/CD for data pipelines. Run dbt tests on every PR. Use staging environments.
Example (GitHub Actions):
name: dbt CI
on: [pull_request]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Install dbt
run: pip install dbt-snowflake
- name: Run tests
run: dbt test --profiles-dir .
Why it matters: Catch breaks before production. Ship confidently.
7. No Alerting on Failures
Mistake: Your nightly ETL fails. Analysts notice 3 days later when dashboards show old data.
Fix: Alert on:
- Job failures (Airflow, dbt Cloud)
- Data freshness (dbt
freshness
tests) - Schema changes (Great Expectations)
Example (dbt Cloud webhook → Slack):
# In Airflow or dbt Cloud
if job.status == 'failed':
send_slack_alert(f"Pipeline {job.name} failed. Check logs: {job.url}")
Why it matters: Fix issues in hours, not days.
Bonus: The Hardest Problem
Data quality isn't just tests—it's culture. We've seen teams with perfect pipelines but dirty source data. Fix it upstream (API validation, DB constraints) or accept that cleaning data is part of the job.
Conclusion
Data pipelines are production software. Treat them like it: version control, CI/CD, monitoring, and on-call rotations.
Need help? We've built data platforms for companies processing 100M+ events/day. Talk to us about your pipeline challenges.