Databricks Data Engineer Professional Exam — Week 2: MERGE, SCD, and Why Transformations Are Permanent Decisions

Week 2: ETL Mastery — Transformations, MERGE Semantics, and SCD Modeling

Jan 05, 2026

Welcome to week 2 of Databricks Data Engineer Professional exam guide.

Week 1 focused on how Databricks expects you to think about data systems: layering, replayability, failure modes, and governance.
Week 2 is where those mental models are applied under pressure—in transformation logic.

This is the part of the exam where:

experienced Spark engineers often over-optimize
incremental logic looks correct but behaves incorrectly over time
pipelines succeed technically while failing operationally

The Professional exam gives the highest weight to transformations because this is where production systems quietly drift from “working” to “untrustworthy.”

1. Why Transformations Dominate the Professional Exam

Transformations are not about moving data.

They are about encoding assumptions:

what constitutes a valid record
how change is interpreted
whether history matters
how late or corrected data behaves

Once encoded, these assumptions are extremely difficult to undo.

The exam reflects this reality.
Most transformation questions are not asking how to implement logic—they are asking:

“Does this logic still hold when data changes, arrives late, or must be reprocessed?”

If the answer is “it depends,” the exam wants to know whether you identified what it depends on.

2. Transformations Are All About Intent

The exam assumes you already know how to write:

joins
filters
aggregations
window functions

What it evaluates is whether your transformations:

preserve semantic meaning
remain correct under reprocessing
fail in observable ways
scale predictably as data grows

A transformation is correct only if it remains correct tomorrow, not just today.

This is why exam questions often describe:

pipelines that “used to work”
jobs that “gradually slowed down”
aggregates that “occasionally differ”

These are not operational but design issues.

3. Delta Lake MERGE: What Actually Happens

MERGE is the most powerful—and most misunderstood—operation in Delta Lake.

Key fact the exam expects you to internalize

MERGE rewrites files, not rows.

When a MERGE executes:

Delta identifies candidate files in the target table
rows are matched against the join condition
entire Parquet files are rewritten, even if a single row changes

This has several implications the exam tests repeatedly.

What MERGE Is Good At

MERGE is appropriate when:

processing CDC feeds
applying conditional updates
handling SCD logic
updating a small subset of a well-partitioned table

In these cases, MERGE expresses intent clearly and safely.

Where MERGE Becomes Dangerous

MERGE becomes a liability when:

tables are frequently updated
file layout is uncontrolled
a large portion of the table matches the join condition
MERGE is treated as a default write strategy

A classic exam framing:

“This MERGE job keeps getting slower over time, even though data volume is stable.”

The correct answer is rarely “increase cluster size.”

The exam expects you to recognize write amplification caused by file rewrites.

4. Choosing the Right Write Strategy

Delta Lake gives you multiple ways to mutate data.
Each one encodes a different long-term assumption.

UPDATE / DELETE

rewrite entire files
expensive at scale
appropriate only for small, targeted corrections

MERGE

conditional logic
CDC-friendly
cost grows with table size and file count

INSERT OVERWRITE

replaces entire partitions
predictable cost
ideal for batch recomputation

The exam often prefers recompute cleanly over incrementally patch forever.

This is a subtle but important shift from older big-data thinking.

5. Slowly Changing Dimensions (SCD): The Most Tested Pattern

SCDs are a formal way to model how meaning changes over time.

SCD Type 1 — Overwrite History

What it does

updates records in place
removes historical values

Correct when

fixing incorrect data
history does not affect analysis

Dangerous when

historical state matters
reporting depends on point-in-time accuracy

The exam penalizes Type 1 when history is implicitly required—even if not stated explicitly.

SCD Type 2 — Preserve History

What it does

closes existing records
inserts new versions
tracks validity windows

Trade-offs

increased storage
more complex queries
higher MERGE cost

The exam consistently prefers correct historical meaning over simplicity when trade-offs exist.

Critical Exam Insight

SCD choice is not technical—it is semantic.

The correct SCD strategy depends on:

how the data is consumed
whether changes redefine meaning
whether analysts expect historical accuracy

Choosing the wrong SCD type produces data that looks correct—but answers the wrong question.

6. Schema Evolution Inside Transformations

Schema evolution is one of the most common silent failure modes.

Safe evolution

adding optional columns
backward-compatible changes
isolated downstream impact

Dangerous evolution

changing column semantics
tightening nullability
allowing new fields to propagate blindly

The exam tests whether schema evolution is:

intentional
localized
governed

Automatic evolution everywhere is treated as a risk, not a feature.

7. Deduplication and Late-Arriving Data

Deduplication is not DISTINCT.

Correct deduplication requires:

a clear definition of uniqueness
deterministic ordering
explicit handling of late records

Poor deduplication logic leads to:

inconsistent aggregates
non-idempotent pipelines
incorrect reprocessing results

The exam expects window-based reasoning over shortcuts.

8. Incremental Processing vs Reprocessing

Incremental logic assumes:

stable keys
bounded change
correct upstream data

When those assumptions break, incremental pipelines become brittle.

Reprocessing:

costs more
is simpler
is often safer

The exam frequently rewards slower but correct over fast but fragile.

Exam-Style Scenarios

Scenario 1: MERGE Performance Degradation

A Silver table is updated hourly using MERGE.
Data volume is stable, but job duration increases steadily.

Most likely cause?
→ File rewrite amplification due to layout, not compute shortage.

Scenario 2: MERGE vs INSERT OVERWRITE

A daily job recomputes the last 7 days of data.

Better strategy?
→ INSERT OVERWRITE on affected partitions for predictable cost.

Scenario 3: SCD Choice

Customer attributes change and historical reports depend on prior state.

Correct model?
→ SCD Type 2, despite higher cost.

Scenario 4: Schema Evolution Breaks Dashboards

A new column propagates automatically and downstream logic fails.

Design flaw?
→ Schema enforcement happened too late.

Common Pitfalls the Exam Targets

Using MERGE everywhere
Defaulting to SCD Type 1
Ignoring file layout
Treating schema evolution as free
Optimizing before semantics are stable

These are experienced-engineer traps.

Closing Perspective

Transformations are where data engineering stops being mechanical.

They are where judgment becomes permanent.

The Databricks Data Engineer Professional exam is designed to expose pipelines that work today—but fail quietly tomorrow.

Design transformations that survive change, and the exam becomes far more predictable.

Hands-On Companion (GitHub)

This post is accompanied by a small set of Databricks notebooks that surface the behaviors discussed here—MERGE file rewrites, SCD trade-offs, schema evolution risks, and write-strategy decisions.

They are optional, but strongly recommended if you are preparing seriously for the Databricks Data Engineer Professional exam.

→ GitHub repo: Databricks Data Engineer Professional – Study Guide (Week 2)

Up Next

Week 3: Performance & Cost Control
Why “fast” pipelines still bleed DBUs—and how the exam expects you to reason about optimization.

Data Launchpad India

Discussion about this post

Ready for more?