If you've spent any time in the modern data ecosystem over the past few years, you've likely encountered the medallion architecture. Originally popularized by Databricks but now widely adopted across the industry, this Bronze-Silver-Gold pattern has become the default way data teams structure their data lakes and lakehouses.
But here's the thing: while the concept is simple on the surface, the implementation details matter enormously. I've seen teams struggle with over-engineered medallion architectures that create more problems than they solve, and I've seen teams use it to build remarkably clean, performant data platforms. The difference comes down to understanding not just what medallion architecture is, but why it exists and how to apply it thoughtfully.
What Is Medallion Architecture?
At its core, medallion architecture is a data design pattern that organizes data in a lakehouse or data lake into three progressive layers of data quality:
- Bronze Layer: Raw data ingestion and persistence
- Silver Layer: Cleaned, validated, and enriched data
- Gold Layer: Business-level aggregates and curated datasets
Think of it as a refinement process, much like ore processing in metallurgy (hence the metal-based naming). Each layer adds value by improving data quality, structure, and usability while maintaining a clear lineage from source to consumption.
The Bronze Layer: Your Source of Truth
The Bronze layer is where raw data lands in your lakehouse, unchanged from its source format. This is your historical archive and recovery point.
What Goes in Bronze
Bronze should contain:
- Exact copies of source data with minimal or no transformations
- Original data types and schema from the source system
- Ingestion metadata (timestamps, source identifiers, batch IDs)
- Both full snapshots and incremental updates, depending on your ingestion pattern
The key principle here is immutability. Once data lands in Bronze, it shouldn't be modified. If you need to re-ingest, you append new data with updated ingestion timestamps. This gives you the ability to replay your entire pipeline from any point in time—a capability that will save you when (not if) something goes wrong downstream.
Common Bronze Patterns
In practice, Bronze typically uses formats optimized for write performance and schema flexibility:
bronze/
source_system=salesforce/
table=accounts/
date=2024-01-15/
batch_001.parquet
source_system=postgres_prod/
table=orders/
date=2024-01-15/
batch_001.parquetI'm opinionated here: use Parquet or Delta Lake format even in Bronze. Yes, you could use JSON or CSV to preserve the "raw" nature, but you're just creating unnecessary pain. Modern columnar formats handle schema evolution beautifully and give you much better performance when you inevitably need to reprocess Bronze data.
The Silver Layer: Where the Real Work Happens
Silver is where your data engineering discipline shows. This layer transforms Bronze data into cleaned, conformed, and enriched datasets ready for analytics.
What Happens in Silver
Typical Silver transformations include:
- Data quality enforcement: Removing duplicates, handling nulls, validating constraints
- Schema standardization: Converting to consistent data types, renaming columns to standards
- Deduplication: Implementing SCD (Slowly Changing Dimensions) patterns where appropriate
- Enrichment: Adding derived columns, parsing nested fields, joining reference data
- Filtering: Removing soft-deleted records or test data
Here's a critical distinction: Silver should still be relatively granular. You're cleaning and structuring, not heavily aggregating. If your Bronze has individual order line items, your Silver should too—just with better data quality.
Silver Design Decisions
One question teams always face: should Silver be normalized or denormalized? My take: it depends on your consumption patterns, but lean toward modest denormalization in Silver.
Create domain-aligned tables that make sense for your business (customers, orders, products) rather than trying to perfectly normalize or create one giant wide table. You want Silver to be usable by multiple downstream use cases, not optimized for just one.
-- Example Silver transformation
CREATE OR REPLACE TABLE silver.orders AS
SELECT
order_id,
customer_id,
CAST(order_date AS DATE) as order_date,
CAST(order_total AS DECIMAL(10,2)) as order_total,
UPPER(TRIM(status)) as status,
-- Add data quality metadata
CURRENT_TIMESTAMP() as processed_at,
'silver_v2' as pipeline_version
FROM bronze.raw_orders
WHERE order_id IS NOT NULL -- Basic quality filter
AND status != 'TEST' -- Remove test data
QUALIFY ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY ingestion_time DESC) = 1 -- DedupeThe Gold Layer: Business-Ready Analytics
Gold is your presentation layer. This is where you create the business-level aggregates, metrics, and curated datasets that directly power dashboards, reports, and ML models.
What Lives in Gold
Gold tables are purpose-built for consumption:
- Aggregated metrics: Daily revenue by region, monthly active users, etc.
- Wide denormalized tables: Everything a dashboard needs in one place
- Feature stores: ML-ready feature sets with proper temporal alignment
- Business entities: Customer 360 views, product hierarchies with all relevant attributes
The key principle: optimize for query performance and user experience. Gold tables should be fast and intuitive. If a business analyst needs to join five tables to answer a basic question, you haven't done your job.
Gold Organization
I strongly recommend organizing Gold by business domain or use case:
gold/
finance/
revenue_daily
expenses_monthly
budget_vs_actual
marketing/
campaign_performance
customer_acquisition_cost
funnel_metrics
product/
feature_usage_weekly
user_retention_cohortsThis makes it immediately clear to consumers what data is available and who owns it.
Common Pitfalls and How to Avoid Them
Over-Engineering Bronze
Don't try to fix data quality issues in Bronze. It's tempting to "just clean this one field" during ingestion, but resist. Bronze should be a simple, reliable landing zone. Save the complexity for Silver.
Creating Too Many Silver Variants
I've seen Silver layers with tables like orders_silver_v1, orders_silver_cleaned, orders_silver_final. This is a code smell. You should have one authoritative Silver version of each entity. Use proper versioning and migration strategies instead of proliferating tables.
Skipping Layers
Sometimes teams go directly from Bronze to Gold for "simple" use cases. In my experience, this creates technical debt. That "simple" report eventually needs better data quality, and now you're maintaining transformation logic in Gold that should be in Silver, and you can't reuse it across use cases.
Not Having Clear Ownership
Each layer needs clear ownership. Typically: data engineers own Bronze and Silver, while analytics engineers or data analysts own Gold. Without this clarity, you get chaos.
When Medallion Architecture Makes Sense
Medallion architecture shines when you have:
- Multiple data sources feeding into analytics
- Multiple downstream consumers with different needs
- Requirements for data quality and governance
- Need for historical data replay and auditing
- A team large enough to maintain distinct layers
For very small teams or simple pipelines (one source, one dashboard), medallion might be overkill. A well-designed two-tier system (raw + curated) could suffice. Don't cargo-cult the pattern just because it's trendy.
The Bottom Line
Medallion architecture isn't magic, but it's a solid, battle-tested pattern for organizing data platforms. The Bronze-Silver-Gold progression gives you clear separation of concerns, data quality checkpoints, and flexibility for multiple use cases.
The key to success is understanding that medallion architecture is a guideline, not a rigid specification. Adapt the pattern to your team's size, your data volumes, and your organizational needs. Start simple, add layers of sophistication as they provide clear value, and always prioritize clarity and maintainability over architectural purity.
Your future self—and your teammates—will thank you.