I've seen it happen countless times: a company migrates to BigQuery, celebrates the flexibility and performance, then gets hit with a five-figure bill at the end of the month. One team I worked with was scanning 47 terabytes of data daily—most of it completely unnecessary. After implementing the strategies I'm about to share, we reduced their monthly BigQuery spend from $18,000 to $2,400.
BigQuery's pricing model is elegant but unforgiving. You pay for what you scan (on-demand) or reserve capacity upfront (flat-rate). Both models reward efficiency and punish carelessness. Let's dig into how to optimize your costs without sacrificing the insights your business needs.
Understanding BigQuery's Pricing Models
Before optimizing, you need to understand what you're paying for. BigQuery offers two primary pricing models:
- On-Demand Pricing: $5 per TB of data processed (first 1TB per month is free)
- Flat-Rate Pricing: Reserve dedicated query processing capacity starting at $2,000/month for 100 slots
For most small to medium workloads, on-demand makes sense. Once you're consistently processing more than 400TB per month, flat-rate becomes economical. But here's the thing: most organizations can cut their processing volume so dramatically that they never need to consider flat-rate pricing.
The Low-Hanging Fruit: Partitioning and Clustering
If you take away only one thing from this article, make it this: partition your tables by date. This single change has delivered the most dramatic cost reductions I've ever seen.
Partitioning: Your First Line of Defense
Partitioning divides your table into segments based on a column value—typically a date. When you query a partitioned table and filter by the partition column, BigQuery only scans the relevant partitions.
CREATE TABLE `project.dataset.events_partitioned`
PARTITION BY DATE(event_timestamp)
AS SELECT * FROM `project.dataset.events`;Let's say you have a 10TB events table spanning two years. Without partitioning, a query for yesterday's data scans all 10TB. With daily partitioning and a proper WHERE clause, you scan maybe 15GB. That's a 99.85% reduction in data processed.
Critical gotcha: Your WHERE clause must directly filter on the partition column. This works: WHERE DATE(event_timestamp) = CURRENT_DATE(). This doesn't optimize: WHERE event_timestamp >= TIMESTAMP(CURRENT_DATE()) (in older BigQuery versions).
Clustering: The Power Multiplier
Clustering sorts data within each partition based on specified columns. This allows BigQuery to skip irrelevant blocks during query execution.
CREATE TABLE `project.dataset.events_optimized`
PARTITION BY DATE(event_timestamp)
CLUSTER BY user_id, event_type
AS SELECT * FROM `project.dataset.events`;If you frequently filter or join on user_id, clustering can reduce costs by another 50-80% on top of partitioning. You can specify up to four clustering columns, ordered by cardinality (highest to lowest generally works best).
Query Optimization: Writing Efficient SQL
Bad SQL is expensive SQL. Here are the patterns that waste money:
Stop Using SELECT *
This is data engineering 101, but I still see it everywhere. SELECT * scans every column in your table, even if you only need two fields.
-- BAD: Scans entire table
SELECT * FROM `project.dataset.large_table`
WHERE user_id = '12345';
-- GOOD: Scans only needed columns
SELECT user_id, event_type, event_timestamp
FROM `project.dataset.large_table`
WHERE user_id = '12345';On a table with 100 columns, this simple change can reduce costs by 95% or more.
Filter Early, Filter Often
Apply WHERE clauses before JOINs and aggregations whenever possible. BigQuery's optimizer is good, but explicit filtering helps it make better decisions.
-- BETTER APPROACH
WITH filtered_events AS (
SELECT user_id, event_type, revenue
FROM `project.dataset.events`
WHERE DATE(event_timestamp) >= DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY)
AND event_type IN ('purchase', 'refund')
)
SELECT user_id, SUM(revenue) as total_revenue
FROM filtered_events
GROUP BY user_id;Avoid Self-Joins on Large Tables
Self-joins can explode your query costs. Consider using window functions or ARRAY_AGG instead:
-- EXPENSIVE: Self-join approach
SELECT a.user_id, a.event_timestamp, b.event_timestamp as previous_event
FROM events a
LEFT JOIN events b
ON a.user_id = b.user_id
AND b.event_timestamp < a.event_timestamp;
-- CHEAPER: Window function approach
SELECT
user_id,
event_timestamp,
LAG(event_timestamp) OVER (PARTITION BY user_id ORDER BY event_timestamp) as previous_event
FROM events;Materialized Views and BI Engine
For queries that run repeatedly, materialized views cache results and automatically refresh them. You pay storage costs (dirt cheap) instead of repeatedly scanning large tables.
CREATE MATERIALIZED VIEW `project.dataset.daily_metrics`
AS
SELECT
DATE(event_timestamp) as event_date,
event_type,
COUNT(*) as event_count,
COUNT(DISTINCT user_id) as unique_users
FROM `project.dataset.events`
GROUP BY 1, 2;BigQuery automatically uses the materialized view when you query the base table with compatible filters. This is particularly valuable for dashboard queries that run hundreds of times per day.
BI Engine takes this further by keeping frequently accessed data in memory. At $100/month for 100GB capacity, it's incredibly cost-effective for analyst-heavy workloads.
Table Design and Storage Optimization
Use Appropriate Data Types
Smaller data types mean less data scanned. Use INT64 instead of STRING for IDs when possible. Use DATE instead of TIMESTAMP if you don't need time precision.
Denormalization vs. Normalization
This is where I get opinionated: denormalize for BigQuery. Traditional database normalization optimizes for storage and write performance. BigQuery optimizes for analytical query performance.
Instead of joining five tables on every query, denormalize common dimensions into your fact table. Yes, you'll store redundant data. Storage costs $20 per TB per month—query costs are $5,000 per TB scanned. Do the math.
Nested and Repeated Fields
BigQuery's native support for STRUCT and ARRAY types is powerful for cost optimization. Instead of separate tables, nest related data:
-- Instead of a separate user_attributes table
CREATE TABLE `project.dataset.users` (
user_id STRING,
name STRING,
attributes ARRAY>
); Monitoring and Governance
You can't optimize what you don't measure. Set up cost monitoring immediately:
Query Execution Monitoring
Use the INFORMATION_SCHEMA views to track query costs:
SELECT
user_email,
query,
total_bytes_processed / POW(10, 12) as tb_processed,
(total_bytes_processed / POW(10, 12)) * 5 as estimated_cost_usd
FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
AND job_type = 'QUERY'
AND state = 'DONE'
ORDER BY total_bytes_processed DESC
LIMIT 100;Run this weekly. You'll identify expensive queries and the people running them.
Set Up Cost Controls
Use custom quotas to prevent runaway queries:
- Set per-user daily query limits
- Implement maximum bytes billed for expensive queries
- Create separate projects for development and production with different quotas
Establish a Review Process
Make query optimization part of your code review process. Before any query goes into production, it should be reviewed for:
- Proper partitioning and clustering usage
- Column selection (no SELECT *)
- Appropriate use of materialized views
- Estimated cost using
--dry-runflag
The 80/20 Rule for BigQuery Optimization
Start with these high-impact actions that deliver 80% of the cost savings:
- Partition all large tables by date
- Eliminate SELECT * from your codebase
- Add clustering to frequently queried columns
- Create materialized views for repeated analytical queries
- Set up cost monitoring and alerts
I've guided teams through this process dozens of times. These five steps consistently reduce costs by 60-85% within the first month. The remaining optimizations—query refactoring, advanced partitioning strategies, flat-rate pricing evaluation—can wait until you've captured the low-hanging fruit.
Remember: BigQuery's pricing model rewards thoughtful data engineering. Every optimization you make pays dividends every single time that query runs. Start tracking, start optimizing, and watch your bills shrink while your query performance improves.