AI Infrastructure Economics Series · Tool 4

Cost Per Prediction (CPP) Calculator

AI's relevance is not determined by hype, but by its unit economics. This calculator decomposes a single inference into its real cost drivers — electricity, hardware amortization, infrastructure, vendor margin — and shows where one deployment mode beats another. Built for the buy-vs-build conversation, not the production benchmark: what one prediction actually costs, which line item is doing the work, and a number documented well enough to survive a CFO review.

Deployment Mode

Three economic structures, three different shapes of bill. Self-hosted carries capex and energy; cloud GPU rental hides hardware behind an hourly rate with margin; public API hides everything behind a per-token line item.

Workload Preset

Three reference inference workloads. Pick the closest match, then edit any value.

Workload Profile

Volume, latency, and the prediction shape. Latency is the wall-clock seconds of GPU time per inference — for an LLM, this is dominated by output token count divided by tokens-per-second throughput.

Predictions per month i

Latency per prediction (sec) i

Prediction shape i

Compute Configuration

GPU class, server overhead, and the capex you are amortizing across the lifetime. For cloud GPU rental, leave the capex fields populated to match the underlying machine — vendor margin is applied separately below.

GPU class i

GPU power draw (W) i

Server overhead (W) i

Effective utilization (%) i

GPU capex ($) i

Server overhead capex ($) i

Useful lifetime (years) i

Electricity & Cooling

The line item the rest of the industry pretends is invisible. PUE folds in cooling, lighting, and facility losses — a 1.5 PUE means every watt of compute drags a half watt of overhead behind it.

Electricity rate ($/kWh) i

PUE (cooling overhead) i

Carbon intensity (gCO₂/kWh) i

Infrastructure & Margin

Network egress, storage, orchestration, and — for cloud GPU mode — the vendor markup applied on top of the underlying compute.

Network/storage cost (¢/prediction) i

Software / orchestration ($/mo) i

Cloud vendor margin (%) i

Public API Comparison

The flat-rate alternative. Per-token pricing converts directly to a per-prediction cost when you fix the input/output shape. Used in API mode as the primary cost; used in self-hosted mode as the breakeven reference line.

Input tokens / prediction

Output tokens / prediction

Input rate ($/M tokens)

Output rate ($/M tokens)

Headline KPIs

Cost per Prediction

—

Predictions per kWh

—

efficiency per watt

Annual Cost

—

at projected volume

Energy share

—

of CPP

Annual Carbon

—

tCO₂e

Self-host vs API breakeven

—

predictions per month

—

Decomposition & Sensitivity

CPP composition

Where one prediction's cost actually goes. The bars no vendor invoice ever shows separately.

Scalability curve

CPP versus monthly volume. Self-host is hyperbolic — fixed costs amortize away as volume rises. API is flat. Where they cross is your breakeven.

Energy decomposition

GPU draw, server overhead, and PUE-implied cooling. The cooling fraction is the line item edge data centers can't escape.

Sensitivity tornado

Annual cost delta from each individual lever. Tells you which assumption is doing the most work in your model.

How to use this — Step-by-step walkthrough

A worked example with sample data. Replace any value with your own as you go.

What this walks through. A 200-person SaaS company runs an internal RAG assistant: ~5M predictions per month, 2K input / 500 output tokens, currently on the public API. They want to know what self-hosting would do to the bill — and at what volume the answer would flip. Two runs (Public API → Self-hosted GPU), the delta between them, exportable as JSON or CSV at the bottom of this page. End result: $810K/year baseline collapses to $48K/year self-hosted, with a breakeven at ~92K predictions/month documenting where each path wins.

Step 1Pick deployment mode

Three economic structures. Run the API baseline first to set the comparison number — every other configuration gets measured against it.

Action	Sample value	Why
Click mode button	Public API	Establish baseline before exploring alternatives. The API is the path of least resistance and what most teams default to without examining unit economics.

Step 2Pick workload preset

Sets workload volume, latency, hardware defaults, electricity, and API rates in one click. Every value remains editable.

Action	Sample value	Why
Click preset button	Enterprise RAG · mid model	Closest match to a 5M/month internal assistant on a Sonnet-class model. Other presets cover edge inference (100K/mo) and large-scale production (100M/mo).

Step 3Verify workload profile

The numbers that drive every other calculation. Override the preset where it doesn't match measured reality.

Field	Definition	Sample	Source of value
Predictions per month	Inferences served at steady state	5,000,000	Product analytics: ~165K/day
Latency per prediction (sec)	Wall-clock GPU time per inference	1.5	p50 measured against current API
Prediction shape	Informational label only	LLM generation	Token-streaming workload

Step 4Verify API rates

Token shape × per-token rate = the API CPP that becomes the breakeven reference line. This is the number self-hosting must beat.

Field	Sample	Source of value
Input tokens / prediction	2,000	System prompt + RAG context + user query
Output tokens / prediction	500	Average model response length
Input rate ($/M)	3	Anthropic Sonnet-class list price
Output rate ($/M)	15	Anthropic Sonnet-class list price

Step 5Read Run 1 — API baseline

Scroll up to the KPI tile row. With the values above, the calculator returns:

Run 1 — Public API

Cost per prediction	1.35¢
Annual cost @ 5M/mo	$810,000
Predictions per kWh	— (mode-dependent)
Energy share	— (hidden in vendor margin)
Annual carbon	— (mode-dependent)

What you can't see in this run. Where the watts went. Where the margin went. Whether the bill is 80% energy or 80% vendor profit. The point of switching modes next is not necessarily to migrate — it's to make those line items visible so the API decision stops being a black box.

Step 6Switch to self-hosted mode

The compute, electricity, and infrastructure sections light up. Cloud margin auto-resets to 0%. The calculator now models the full unit cost of producing one prediction in-house.

Action	Sample value	Why
Click mode button	Self-hosted GPU	Make the line items visible. Run the same workload through a transparent cost stack instead of a black-box per-token rate.

Step 7Verify compute configuration

GPU class drives both power draw and capex; both are editable independently. Utilization is the most-fudged input — be honest with it. An idle GPU still amortizes capex but produces no predictions.

Field	Sample	Source of value
GPU class	H100	Standard inference workhorse for mid-tier LLMs
GPU power draw (W)	700	NVIDIA H100 SXM5 datasheet TDP
Server overhead (W)	200	Per-GPU share of host CPU, RAM, NICs, fans, storage
Effective utilization (%)	70	Steady-state production midpoint; bursty workloads run lower
GPU capex ($)	30,000	Enterprise channel street price for H100 SXM5
Server overhead capex ($)	5,000	1/8 of typical 8-GPU HGX node non-GPU cost
Useful lifetime (years)	4	Standard depreciation horizon for production GPUs

Step 8Verify electricity & cooling

The line item industry pricing pretends doesn't exist. PUE folds in cooling and facility losses; carbon intensity is optional and only affects the carbon KPI.

Field	Sample	Source of value
Electricity rate ($/kWh)	0.12	US industrial average; PPA contracts can cut this in half
PUE (cooling overhead)	1.4	Typical enterprise data center; hyperscale runs 1.1–1.3
Carbon intensity (gCO₂/kWh)	380	US grid average; varies dramatically by region and time of day

Step 9Verify infrastructure & margin

The smaller line items. At enterprise volume they round to noise; at edge volume the software fixed cost can dominate.

Field	Sample	Source of value
Network/storage (¢/prediction)	0.005	Order of magnitude for vector DB read + egress + blob I/O
Software / orchestration ($/mo)	500	Serving framework + observability + gateway
Cloud vendor margin (%)	0	Self-hosted = no margin; cloud GPU rental would set this to 30–50%

Step 10Read Run 2 — Self-hosted readout

The KPI row updates immediately. Scroll up to read it.

Run 2 — Self-hosted GPU

Cost per prediction	0.081¢
Annual cost @ 5M/mo	$48,454
Predictions per kWh	~1,900
Energy share	7.8% of CPP
Annual carbon	~12 tCO₂e (31.5 MWh × 380 g)
Breakeven volume	~92,000 predictions/mo

The breakeven is the framework's payoff. Self-host CPP equals API CPP at ~92K predictions per month. Below that, the API wins — capex amortizes over too few inferences. Above it, self-host wins by a margin that compounds with every additional prediction. At 5M/mo, the company is 54× past breakeven. The decision isn't whether to self-host — it's how fast.

Step 11Compare runs side by side

Both numbers in one frame. The delta is the dollar value of the deployment decision at this volume.

Run 1 — Public API

CPP	1.35¢
Annual	$810,000
Energy share	— (hidden)
Carbon	— (hidden)

Run 2 — Self-hosted

CPP	0.081¢
Annual	$48,454
Energy share	7.8%
Carbon	~12 tCO₂e

Delta: $761,546 saved annually — a 94% reduction off the API baseline. CPP is 16.7× lower per prediction.

What the framework just made visible. Hardware amortization is 74% of self-hosted CPP. Energy is 8%. The headline isn't "electricity is cheap" — at this volume it isn't even the largest line item. The headline is that the API was charging 16× the underlying production cost, and the markup was invisible until you ran the alternative model. Self-hosting is the right answer here, but the deeper finding is that without CPP you would not have known by how much.

Step 12Read the four charts

Numbers tell you the size of the bill. Charts tell you what's driving it and where the levers are.

CPP composition

Hardware bar dominates. Energy is small but visible. Cooling (PUE overhead) is roughly half the IT-load energy — the line item often forgotten in back-of-envelope math.

Scalability curve

Self-host CPP collapses as volume rises (hyperbolic). API line is flat. Crossing point is breakeven; the height of the gap above breakeven is the savings rate per prediction.

Energy decomposition

Doughnut: GPU draw / server overhead / cooling. At PUE 1.4, cooling is 29% of total energy — every watt of compute drags 0.4W of facilities behind it.

Sensitivity tornado

Annual $ delta from each lever applied individually. Tells you which assumption matters most — and where to focus measurement effort before committing.

Step 13Stress-test the assumption you trust least

Triple the electricity rate from $0.12 to $0.36/kWh — simulating a high-cost grid, no PPA, peak pricing exposure.

Field	Stress value	Result
Electricity rate ($/kWh)	0.36	CPP rises from 0.081¢ to 0.093¢ (+15.6%); annual goes from $48,454 to $56,022; energy share jumps from 8% to 20%

Why electricity belongs on the balance sheet anyway. CPP is per-prediction. Aggregate matters. This 5M/month workload burns 31.5 MWh/year — invisible at the unit, real at the meter. Move to 5B/month with frontier models and the same configuration burns 31.5 GWh — a PPA conversation, a balance-sheet item, a procurement workstream. CPP is the lens that catches both regimes in one number.

Step 14Export

Capture configuration and computed results for finance review or board memo.

Format	What it's for
Download JSON	Full nested structure. Re-importable into the calculator if archived. Best for programmatic use or version control.
Download CSV	Flat key/value file. Opens directly in Excel or Sheets — no Power Query needed. Best for finance modeling, board memos, audit trails.

What you walk away with

An annual operating-cost number for both deployment paths, with documented assumptions. A breakeven volume that tells you whether the migration is on the table or theoretical. A sensitivity reading that surfaces which input matters most. A reproducible artifact for finance review. Token economics, capital allocation, and operating cost are now connected through a single unit.

Export

Download the current configuration. JSON preserves the full structure for re-import. CSV is a flat key/value file that opens directly in Excel or Sheets — useful for finance modeling and audit trails.

Full Assumptions & Sourcing

Show every default value, its source, and its rationale

Value	Default	Source & rationale
H100 SXM power draw	700 W	NVIDIA H100 SXM5 datasheet TDP. Real production draw under sustained inference load is typically 70–90% of TDP — the GPU rarely runs at thermal limit continuously.
H200 SXM power draw	700 W	NVIDIA H200 datasheet. Same TDP envelope as H100; difference is HBM3e capacity and bandwidth, not power.
A100 80GB power draw	400 W	NVIDIA A100 SXM4 80GB datasheet TDP.
L40S power draw	350 W	NVIDIA L40S datasheet TDP. Common choice for cost-optimized inference where HBM is overkill.
L4 power draw	72 W	NVIDIA L4 datasheet TDP. Edge-class inference accelerator.
H100 capex	$30,000	List/street pricing typical for SXM5 modules through enterprise channels. Negotiated volume pricing varies considerably.
Server overhead capex	$5,000 / GPU	Per-GPU share of chassis, dual CPU, 1–2TB RAM, NVMe, networking. For an 8-GPU HGX node, divide total non-GPU server cost by 8.
Server overhead power	200 W / GPU	Per-GPU share of host CPU, RAM, NICs, fans, storage. 8-GPU nodes typically draw 1.5–2 kW outside the GPUs themselves.
Useful lifetime	4 years	Common depreciation horizon for production GPU clusters. Tax-accounting useful life often differs (Meta and Microsoft have publicly extended to 6 years for accounting purposes).
Effective utilization	70%	Steady-state production. Idle time is time GPUs amortize capex with no predictions returned. High-volume API providers achieve 80–90%; bursty enterprise workloads run 40–60%.
Electricity rate	$0.12 / kWh	Approximate US industrial average. Hyperscalers with long-term PPAs reach $0.04–0.07; high-cost regions and edge sites exceed $0.20.
PUE	1.4	Typical enterprise data center. Modern hyperscale runs 1.1–1.3 (Google reports fleet PUE ~1.10). Edge / on-premises closets often 1.6–1.8+.
Grid carbon intensity	380 gCO₂ / kWh	US grid average; varies dramatically by region and time of day. CAISO daytime can be 200; coal-heavy grids exceed 700. Ember and ElectricityMap publish location-specific data.
Cloud vendor margin	0% (self-host) / 30–50% (cloud)	Markup over underlying energy + hardware cost in cloud GPU rental. Reverse-engineered from public hourly rates against datasheet TDP and street capex.
Network / storage / prediction	0.005¢	Order-of-magnitude estimate for vector DB read + egress + blob I/O on a typical RAG inference. Highly implementation-dependent; replace with measured value.
Software / orchestration	$500/mo	Order-of-magnitude for serving framework + observability + gateway. Material for small workloads, negligible at scale.
API rates (Sonnet-class default)	$3 in / $15 out per M tokens	Anthropic published list price for mid-tier model. Editable to match negotiated rates or alternative vendors.
Edge preset	100K/mo, L4, 0.3s, $0.10/kWh	Small classifier or embedded chatbot at branch / store-level. Low volume; hardware amortization dominates CPP.
Enterprise preset	5M/mo, H100, 1.5s, $0.12/kWh	Internal RAG assistant or coding tool at mid-to-large company. The most common shape.
Production preset	100M/mo, H100, 2s, $0.15/kWh	Customer-facing product feature at scale. Energy starts to register as a real line item.