Optimal Datatypes to Reduce Storage & Compute Costs

Every column definition is a permanent commitment about how data is stored, compressed, and processed — across billions of rows, thousands of queries, and years of platform life. The most expensive mistakes are also the easiest to prevent.

Platform coverage: Snowflake · BigQuery · Databricks · Redshift · DuckDB

The numbers upfront

Type decision	Cost impact
INT vs BOOLEAN for a flag	8× more storage
VARCHAR(4000) for a short code	40× worse compression
Type mismatch on cluster key predicate	100% partitions scanned (20× compute)
VARIANT vs native typed column (hot field)	5–20× slower queries

Why your cloud bill starts at CREATE TABLE

Most cost conversations in data engineering jump straight to warehouse sizing, clustering strategies, and pipeline scheduling. These are legitimate levers — but they all operate on data that has already been stored, already been compressed, and is already on its way to the query engine.

The decision with the longest-lasting cost consequence happens much earlier, and much more quietly: the CREATE TABLE statement.

Every column definition you write is a contract. You are telling the storage engine how to encode the data, the compression algorithm what to expect, the query planner what metadata it can trust, and setting the physical footprint that every downstream query will pay to read. Get the type right and all four systems work in your favour simultaneously. Get it wrong and all four penalise you — on every query, every pipeline run, and every billing cycle, for the entire lifetime of the table.

A wrong datatype is not a schema detail. It is a permanent cost commitment made at table creation, paid on every query that runs against that table for the rest of its life.

How columnar engines use type information

To understand why datatypes cost money, you need to understand how columnar storage works. In a row-oriented system, each row is stored contiguously — all columns of row 1 together, then row 2. In a columnar engine, all values of one column are stored together — every value of status_code in one file, every value of order_date in another.

This layout enables a powerful optimisation chain. Before the compression algorithm runs, the engine applies column-level encodings. The declared type is central to which encodings are available.

The encoding pipeline

Raw column values
        ↓
Step 1: Type encoding
  BOOLEAN    → 1 bit per value
  DATE       → 4-byte integer (days since epoch)
  SMALLINT   → 2 bytes per value
  VARCHAR(N) → actual length bytes + 1–4 byte length prefix

        ↓
Step 2: Structural encoding
  Dictionary encoding  — low-cardinality columns: replace repeating
                         values with small integer IDs
                         (4 distinct statuses → 2-bit IDs)
  Run-length encoding  — consecutive identical values: store (value, count)
  Delta encoding       — monotonic sequences: store differences
  Bit-packing          — pack small integers into minimal bits

        ↓
Step 3: Compression codec (ZSTD, Snappy, LZ4, Gzip)
  Applied to already-encoded bytes — encoding dramatically
  improves what compression can achieve

        ↓
Stored column chunk + per-column MIN/MAX statistics

The critical insight: encoding happens before compression. A status_code column with 4 possible values becomes a stream of 2-bit integers before ZSTD ever touches it. The compressor achieves excellent ratios on {0, 1, 2, 3}. Stack both together and a 10 GB CSV arrives on disk as roughly 1 GB. That compound effect is only possible when the type declaration signals bounded, predictable structure to the encoder.

The high-impact type decisions

1. BOOLEAN vs INTEGER for flag columns

A column that stores true/false, yes/no, active/inactive — typed as INTEGER — uses 4 bytes per row where 1 bit would suffice. On a 500 million-row fact table, this means 2 billion wasted bytes before compression applies. BOOLEAN also compresses dramatically better: a column of 0s and 1s encoded as integers is near-random to a compressor; a column of BOOLEAN values is recognised as 1-bit data and packed trivially.

-- ❌ Anti-pattern
CREATE TABLE fact_orders (
  is_active  INTEGER,      -- 4 bytes. Always 0 or 1. No encoding benefit.
  ...
);

-- ✅ Correct
CREATE TABLE fact_orders (
  is_active  BOOLEAN,      -- 1 bit. Bit-packed trivially. ~60 MB vs 2 GB.
  ...
);

Real impact on 500M rows: ~2 GB raw → ~60 MB compressed. A single column change.

2. DATE vs TIMESTAMP when time is irrelevant

TIMESTAMP stores 8 bytes per row. DATE stores 4 bytes. If the time component on your order_date column is always midnight — as it typically is on any date-grained fact table — you are storing 4 completely redundant bytes per row.

But the storage cost is secondary. The primary cost of TIMESTAMP on a date-grained column is what it does to partition skipping.

Every columnar engine stores per-column MIN/MAX statistics alongside each data file. When a query filter arrives, the engine compares the predicate value against each file's range and skips non-matching files entirely — zero I/O for skipped files.

This only works when the predicate type matches the column type exactly. A BI tool passes a DATE literal against a TIMESTAMP column — or vice versa. The engine must resolve the type mismatch before comparing. In many engine versions, this comparison fails silently and the engine reads every partition.

⚠️ Partition skipping failure: A query that should skip 95% of a 10 TB table reads all of it. That is a 20× compute cost increase on every execution — permanently, until the type is corrected. BI tools and ORM frameworks routinely pass parameters without explicit types. Audit your generated SQL.

-- ❌ TIMESTAMP on a date-grained fact table
order_date  TIMESTAMP   -- 8 bytes, type mismatch risk on date predicates

-- ✅ DATE — correct type, correct size, correct statistics
order_date  DATE        -- 4 bytes, enables clean date-range partition skipping

3. VARCHAR bounds: the most underestimated cost lever

This is where the mechanism needs to be stated precisely.

VARCHAR is variable-length. On disk, a row storing 'ACTIVE' uses exactly 6 bytes plus a short length prefix — regardless of whether the column is declared VARCHAR(8) or VARCHAR(4000). The declared maximum does not pre-allocate storage per row. A claim that "VARCHAR(4000) allocates 4,000 bytes per row in sort buffers" is overstated for modern columnar engines like Snowflake, BigQuery, and Databricks, which operate on actual value widths in working memory. That behaviour exists in some older RDBMS engines (certain SQL Server and MySQL configurations), but it is not a universal truth.

The genuine, certain costs of VARCHAR(4000) for a column that stores 6–9 character codes are:

Mechanism 1 — Dictionary encoding is skipped

When the engine sees VARCHAR(4000), it cannot safely assume bounded cardinality. It skips dictionary encoding entirely. The compressor receives raw variable-length strings with poor repetition structure.

Mechanism 2 — Compressed file size is larger

Without encoding, the compression algorithm works on raw string bytes and achieves far lower ratios than on 2-bit integer IDs. The column chunk on disk is 8–40× larger. Every query decompresses more bytes per scan.

Mechanism 3 — Partition skipping fails on type mismatch

A type mismatch between the column's declared type and an incoming predicate (common from BI tools and ORMs that pass untyped strings) forces an implicit cast. The engine cannot compare the result against partition MIN/MAX statistics and reads every partition.

-- ❌ Over-declared — dictionary encoding skipped
status_code  VARCHAR(4000)   -- actual max value: 'CANCELLED' = 9 chars

-- ✅ Tight bound — dictionary encoding applies
status_code  VARCHAR(9)      -- 4 values, 2-bit dictionary IDs, 8× smaller chunk

Storage on 500M rows:

Declaration	Dictionary encoding	Compressed size	Per 100 queries
VARCHAR(4000)	Skipped	~480 MB	48 GB decompressed
VARCHAR(255)	Skipped	~480 MB	48 GB decompressed
VARCHAR(9)	Applied	~60 MB	6 GB decompressed

Note that VARCHAR(255) produces the same outcome as VARCHAR(4000). The encoding decision is not a linear function of the bound — once the declared maximum signals that arbitrary-length data may arrive, the encoder conservatively skips dictionary encoding. Tightening to the actual data maximum is what crosses the threshold.

Finding the correct bound:

-- Run this on your actual data
SELECT MAX(LENGTH(status_code)) FROM fact_orders;
-- Set VARCHAR to exactly that value. No padding. No rounding up.

4. SMALLINT vs BIGINT for low-range codes

Status codes, category IDs, and country codes rarely exceed 32,767. SMALLINT uses 2 bytes. BIGINT uses 8 bytes — four times the storage for the same information. Use BIGINT only when values genuinely exceed 2,147,483,647.

5. DECIMAL vs FLOAT for financial values

IEEE 754 binary floating-point cannot represent most decimal fractions exactly. 0.1 does not exist in binary floating-point — it is stored as the closest representable approximation. Aggregate millions of such values and the errors accumulate silently.

-- ❌ FLOAT: rounding error accumulates at scale
SELECT SUM(revenue_float) FROM fact_transactions;
-- Result: 14,522.990000000182
-- $182 discrepancy on 1M rows. No error. No warning.

-- ✅ DECIMAL(18,4): exact decimal arithmetic
SELECT SUM(revenue_decimal) FROM fact_transactions;
-- Result: 14,522.9900
-- Reconciles exactly with source system.

For financial columns, FLOAT is not just a cost choice — it is a correctness choice. Use DECIMAL(p, s) where p is total digits and s is scale. DECIMAL(18, 4) covers any realistic revenue, price, or margin value.

6. VARIANT / nested columns — the query processing tax

Semi-structured columns — VARIANT in Snowflake, STRUCT/ARRAY in BigQuery, complex types in Spark — appear to save storage by collapsing sparse attributes into a single blob. For any field queried with meaningful frequency, they consistently cost more in compute than they save in storage.

Why: a field nested inside a JSON or struct column requires reading the entire parent blob per row, deserialising the structure at runtime, and extracting the field — even if you only access one key. No column-level pruning. No partition statistics on nested paths. No type encoding.

The cost crossover: storage saving from nesting is fixed and bounded. Query processing penalty scales linearly with query frequency. At roughly 5+ queries per month against a nested field on a 500M-row table, query cost exceeds the storage saving. For a dashboard column queried 100 times a day, nesting can cost 10–50× more in compute than it saves in storage over a year.

The hybrid pattern — the right answer:

CREATE TABLE fact_orders_hybrid (
  order_id          VARCHAR(36),
  order_date        DATE,              -- cluster key

  -- HIGH-FREQUENCY: flat typed columns
  -- Any field in WHERE / GROUP BY / JOIN → always flatten
  prod_category_l1  VARCHAR(40),      -- filtered in every report
  prod_brand        VARCHAR(80),      -- grouped in margin dashboards
  prod_price_tier   VARCHAR(8),       -- 'PREMIUM','MID','BUDGET'

  -- LOW-FREQUENCY: VARIANT overflow for sparse attributes
  -- Queried fewer than 5×/month, never in WHERE/JOIN → nesting justified
  product_attrs_ext VARIANT
  -- color, weight, material, warranty, dimensions...
)
CLUSTER BY (order_date);

The rule: if a field appears in WHERE, JOIN, or GROUP BY — promote it to a flat native typed column. If it is only selected in detail views and queried fewer than 5 times per month — nesting is justified. Review quarterly as consumption patterns evolve.

Partition skipping — the deeper mechanism

Four patterns that silently kill partition skipping, beyond the type mismatch already covered:

-- Pattern 1: Function-wrapped cluster key — always a full scan
-- ❌ Wrong
WHERE YEAR(order_date) = 2024
WHERE DATE_TRUNC('month', order_date) = '2024-01-01'

-- ✅ Correct: range predicate, no function on the column
WHERE order_date BETWEEN '2024-01-01' AND '2024-12-31'

-- Pattern 2: String literal against integer column (ORMs)
-- ❌ Wrong
WHERE customer_id = '100045'     -- string vs BIGINT

-- ✅ Correct: typed literal
WHERE customer_id = 100045

-- Pattern 3: BI tool passes TIMESTAMP against DATE column
-- ❌ Wrong
WHERE order_date = '2024-01-15 00:00:00'

-- ✅ Correct
WHERE order_date = '2024-01-15'  -- or DATE '2024-01-15'

-- Pattern 4: CAST on cluster key column
-- ❌ Wrong
WHERE CAST(order_date AS VARCHAR) = '2024-01-15'

-- ✅ Correct: filter on the column directly
WHERE order_date = DATE '2024-01-15'

Validate pruning in Snowflake:

SELECT
  partitions_scanned,
  partitions_total,
  ROUND(100 * partitions_scanned / partitions_total, 1) AS pct_scanned
FROM snowflake.account_usage.query_history
WHERE start_time >= DATEADD('minute', -5, CURRENT_TIMESTAMP)
ORDER BY start_time DESC;

-- Target: below 10% for selective date queries
-- Red flag: 100% means zero pruning — investigate immediately

The one-afternoon audit

Find your top 20 tables by storage size. For each, look for these five patterns:

INTEGER columns always 0 or 1 → BOOLEAN candidates SELECT COUNT(DISTINCT col) FROM table WHERE col NOT IN (0, 1) — if zero rows, it's a boolean.
TIMESTAMP columns where time is always midnight → DATE candidates SELECT COUNT(*) FROM table WHERE TIME(col) != TIME '00:00:00' — if zero rows, use DATE.
VARCHAR columns with declared bound far above actual max → over-allocated SELECT MAX(LENGTH(col)) FROM table — set bound to exactly that value.
VARIANT columns with consistent keys on every row → promotion candidates SELECT COUNT(*), COUNT(col:field_name) FROM table — near-equal counts mean the field is always present; flatten it.
FLOAT columns in financial aggregations → DECIMAL candidates Both a correctness fix and a compression improvement.

-- Snowflake: top 20 tables by storage
SELECT
  table_name,
  ROUND(active_bytes / 1e9, 2)       AS active_gb,
  ROUND(time_travel_bytes / 1e9, 2)  AS time_travel_gb,
  row_count
FROM snowflake.account_usage.table_storage_metrics
WHERE deleted IS NULL
ORDER BY active_bytes DESC
LIMIT 20;

-- For each table, check column types and actual max lengths
SELECT column_name, data_type, character_maximum_length
FROM information_schema.columns
WHERE table_name = 'FACT_ORDERS'
ORDER BY ordinal_position;

💡 On modern columnar engines, altering a column type is a metadata-only operation — no data rewrite required. New micro-partitions written after the change benefit from the tighter encoding immediately. Run ALTER TABLE fact_orders ALTER COLUMN status_code SET DATA TYPE VARCHAR(9) and the savings begin on the next write.

Quick reference — type selection rules

If the column holds	Use	Why
True/false, yes/no	BOOLEAN	1 bit. 32× smaller than INT. Trivial encoding.
A date, no time component	DATE	4 bytes vs 8. Enables clean date-range partition skipping.
Short codes (status, type, region)	VARCHAR(actual max)	Tight bound enables dictionary encoding.
Small numeric codes (≤ 32,767)	SMALLINT	2 bytes vs 8 for BIGINT.
Money, prices, financial amounts	DECIMAL(18, 4)	Exact decimal arithmetic. FLOAT accumulates rounding errors silently.
Field in VARIANT queried daily in WHERE / GROUP BY	Flat native typed column	Never in VARIANT. Cannot benefit from column pruning or partition statistics.
Sparse field, queried < 5×/month	VARIANT	Storage saving justified. Review quarterly.
Cluster key column	Must match predicate type exactly	Any mismatch forces a full scan on every query.

Closing

What makes datatype optimisation different from most platform cost levers is its compounding, permanent character. A warehouse auto-suspend policy saves credits until someone changes it. A clustering key benefits queries until the data distribution shifts. But a correct datatype benefits every query, every pipeline run, and every billing cycle for the entire lifetime of the table — with no ongoing maintenance and no degradation over time.

The inverse is equally true. A wrong type penalises every query that touches that column from the day the table is created to the day it is retired. The developer who writes VARCHAR(4000) for a status field has quietly committed the platform to paying the encoding and compression penalty thousands of times per day, indefinitely. That cost will never appear on a dashboard line item. It will simply accumulate in aggregate compute spend, invisible and unchallenged.

Datatype decisions are made once. Their cost consequences last forever. The audit takes an afternoon. The savings compound for years.

The column type is not a schema detail. It is the first line of your cost optimisation strategy.

Tags: #DataEngineering #CloudFinOps #DataModeling #CostOptimization #SQLTips #Snowflake #BigQuery #Databricks #AnalyticsEngineering

Selecting Optimal Datatypes: How Schema Decisions Drive Storage and Compute Cost

The numbers upfront

Why your cloud bill starts at CREATE TABLE

How columnar engines use type information

The encoding pipeline

The high-impact type decisions

1. BOOLEAN vs INTEGER for flag columns

2. DATE vs TIMESTAMP when time is irrelevant

3. VARCHAR bounds: the most underestimated cost lever

4. SMALLINT vs BIGINT for low-range codes

5. DECIMAL vs FLOAT for financial values

6. VARIANT / nested columns — the query processing tax

Partition skipping — the deeper mechanism

The one-afternoon audit

Quick reference — type selection rules

Closing

Comments

More from this blog

Snowflake Micro-Partitions: The storage unit that powers everything

The Silent Multiplier: How Snowflake Time Travel Is Quietly Tripling Your Bill

Command Palette

The numbers upfront

Why your cloud bill starts at CREATE TABLE

How columnar engines use type information

The encoding pipeline

The high-impact type decisions

1. BOOLEAN vs INTEGER for flag columns

2. DATE vs TIMESTAMP when time is irrelevant

3. VARCHAR bounds: the most underestimated cost lever

4. SMALLINT vs BIGINT for low-range codes

5. DECIMAL vs FLOAT for financial values

6. VARIANT / nested columns — the query processing tax

Partition skipping — the deeper mechanism

The one-afternoon audit

Quick reference — type selection rules

Closing

Comments

More from this blog