7 Data Pipeline Mistakes That Cost You Thousands

Data pipelines fail quietly. A broken ETL job runs for weeks, analysts trust stale dashboards, and by the time you notice, you've lost customer trustâ€”or spent $10K on runaway queries.

We've built data platforms for e-commerce, fintech, and SaaS companies. Here are the 7 mistakes we see repeatedly (and how to avoid them).

1. No Schema Validation

Mistake: Your API changes a field from string to int. Your pipeline ingests it silently. Downstream jobs break days later.

Fix: Validate schemas at ingestion. Use tools like:

Great Expectations (Python) or dbt tests (SQL)
Enforce contracts between producers and consumers

Example (dbt test):

-- models/schema.yml
version: 2
models:
  - name: orders
    columns:
      - name: order_id
        tests:
          - unique
          - not_null
      - name: amount
        tests:
          - not_null
          - dbt_utils.accepted_range:
              min_value: 0

Why it matters: Catch bad data early. Fail fast instead of corrupting downstream tables.

2. Full Table Scans Every Day

Mistake: Your nightly ETL job reads the entire users table (10M rows) even though only 100 rows changed.

Fix: Use incremental loads. Track updated_at timestamps or change data capture (CDC).

Example (dbt incremental):

{{ config(
    materialized='incremental',
    unique_key='user_id'
) }}

SELECT * FROM raw.users
{% if is_incremental() %}
WHERE updated_at > (SELECT MAX(updated_at) FROM {{ this }})
{% endif %}

Why it matters: Faster pipelines. Lower compute costs. Snowflake/BigQuery charge by data scanned.

3. No Idempotency

Mistake: Your pipeline runs twice (Airflow retry, manual trigger). Now you have duplicate rows.

Fix: Make pipelines idempotent. Use MERGE (upsert) instead of INSERT. Track run IDs.

Example (Postgres):

INSERT INTO orders (order_id, amount, updated_at)
VALUES (123, 99.99, NOW())
ON CONFLICT (order_id)
DO UPDATE SET amount = EXCLUDED.amount, updated_at = EXCLUDED.updated_at;

Why it matters: Retries are safe. No manual cleanup after failures.

4. Ignoring Cost Monitoring

Mistake: You set up a BigQuery pipeline. Three months later, your bill is $8K. Turns out a dev was SELECT * on a 5TB table for testing.

Fix:

Set query cost alerts (BigQuery, Snowflake dashboards)
Use LIMIT in dev. Partition tables by date.
Review most expensive queries monthly.

Example (BigQuery cost tracking):

SELECT
  user_email,
  SUM(total_bytes_billed) / POW(10, 12) AS tb_billed,
  SUM(total_slot_ms) / (1000 * 60 * 60) AS slot_hours
FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
GROUP BY user_email
ORDER BY tb_billed DESC
LIMIT 10;

Why it matters: Catch cost overruns before they're a budget crisis.

5. No Data Lineage

Mistake: A metric changes. No one knows which upstream table caused it. You spend 2 hours tracing joins.

Fix: Document lineage. Tools:

dbt docs (auto-generates lineage graphs)
OpenLineage (open standard)
Airflow + custom DAG metadata

Why it matters: Debug faster. Trust your data.

6. Testing Only in Production

Mistake: You deploy a new dbt model. It breaks the CEO's dashboard. You rollback at 11 PM.

Fix: CI/CD for data pipelines. Run dbt tests on every PR. Use staging environments.

Example (GitHub Actions):

name: dbt CI
on: [pull_request]
jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2
      - name: Install dbt
        run: pip install dbt-snowflake
      - name: Run tests
        run: dbt test --profiles-dir .

Why it matters: Catch breaks before production. Ship confidently.

7. No Alerting on Failures

Mistake: Your nightly ETL fails. Analysts notice 3 days later when dashboards show old data.

Fix: Alert on:

Job failures (Airflow, dbt Cloud)
Data freshness (dbt freshness tests)
Schema changes (Great Expectations)

Example (dbt Cloud webhook â†’ Slack):

# In Airflow or dbt Cloud
if job.status == 'failed':
    send_slack_alert(f"Pipeline {job.name} failed. Check logs: {job.url}")

Why it matters: Fix issues in hours, not days.

Bonus: The Hardest Problem

Data quality isn't just testsâ€”it's culture. We've seen teams with perfect pipelines but dirty source data. Fix it upstream (API validation, DB constraints) or accept that cleaning data is part of the job.

Conclusion

Data pipelines are production software. Treat them like it: version control, CI/CD, monitoring, and on-call rotations.

Need help? We've built data platforms for companies processing 100M+ events/day. Talk to us about your pipeline challenges.