AI Infrastructure Economics Series · Tool 4

Cost Per Prediction (CPP) Calculator

AI's relevance is not determined by hype, but by its unit economics. This calculator decomposes a single inference into its real cost drivers — electricity, hardware amortization, infrastructure, vendor margin — and shows where one deployment mode beats another. Built for the buy-vs-build conversation, not the production benchmark: what one prediction actually costs, which line item is doing the work, and a number documented well enough to survive a CFO review.

Why this tool exists. Token pricing tells you what a model vendor charges. CPP tells you what a prediction actually costs to produce — including the watts your competitors are pretending are free. The defaults below are starting points, not claims; the assumptions table at the bottom shows the source of every number. Replace any value with your own measurement and the model recomputes live.

Three economic structures, three different shapes of bill. Self-hosted carries capex and energy; cloud GPU rental hides hardware behind an hourly rate with margin; public API hides everything behind a per-token line item.

Three reference inference workloads. Pick the closest match, then edit any value.

Volume, latency, and the prediction shape. Latency is the wall-clock seconds of GPU time per inference — for an LLM, this is dominated by output token count divided by tokens-per-second throughput.

GPU class, server overhead, and the capex you are amortizing across the lifetime. For cloud GPU rental, leave the capex fields populated to match the underlying machine — vendor margin is applied separately below.

The line item the rest of the industry pretends is invisible. PUE folds in cooling, lighting, and facility losses — a 1.5 PUE means every watt of compute drags a half watt of overhead behind it.

Network egress, storage, orchestration, and — for cloud GPU mode — the vendor markup applied on top of the underlying compute.

The flat-rate alternative. Per-token pricing converts directly to a per-prediction cost when you fix the input/output shape. Used in API mode as the primary cost; used in self-hosted mode as the breakeven reference line.

Predictions per kWh
efficiency per watt
Annual Cost
at projected volume
Energy share
of CPP
Annual Carbon
tCO₂e
Self-host vs API breakeven
predictions per month

CPP composition

Where one prediction's cost actually goes. The bars no vendor invoice ever shows separately.

Scalability curve

CPP versus monthly volume. Self-host is hyperbolic — fixed costs amortize away as volume rises. API is flat. Where they cross is your breakeven.

Energy decomposition

GPU draw, server overhead, and PUE-implied cooling. The cooling fraction is the line item edge data centers can't escape.

Sensitivity tornado

Annual cost delta from each individual lever. Tells you which assumption is doing the most work in your model.

A worked example with sample data. Replace any value with your own as you go.

What this walks through. A 200-person SaaS company runs an internal RAG assistant: ~5M predictions per month, 2K input / 500 output tokens, currently on the public API. They want to know what self-hosting would do to the bill — and at what volume the answer would flip. Two runs (Public API → Self-hosted GPU), the delta between them, exportable as JSON or CSV at the bottom of this page. End result: $810K/year baseline collapses to $48K/year self-hosted, with a breakeven at ~92K predictions/month documenting where each path wins.
Step 1Pick deployment mode

Three economic structures. Run the API baseline first to set the comparison number — every other configuration gets measured against it.

ActionSample valueWhy
Click mode buttonPublic APIEstablish baseline before exploring alternatives. The API is the path of least resistance and what most teams default to without examining unit economics.
Step 2Pick workload preset

Sets workload volume, latency, hardware defaults, electricity, and API rates in one click. Every value remains editable.

ActionSample valueWhy
Click preset buttonEnterprise RAG · mid modelClosest match to a 5M/month internal assistant on a Sonnet-class model. Other presets cover edge inference (100K/mo) and large-scale production (100M/mo).
Step 3Verify workload profile

The numbers that drive every other calculation. Override the preset where it doesn't match measured reality.

FieldDefinitionSampleSource of value
Predictions per monthInferences served at steady state5,000,000Product analytics: ~165K/day
Latency per prediction (sec)Wall-clock GPU time per inference1.5p50 measured against current API
Prediction shapeInformational label onlyLLM generationToken-streaming workload
Step 4Verify API rates

Token shape × per-token rate = the API CPP that becomes the breakeven reference line. This is the number self-hosting must beat.

FieldSampleSource of value
Input tokens / prediction2,000System prompt + RAG context + user query
Output tokens / prediction500Average model response length
Input rate ($/M)3Anthropic Sonnet-class list price
Output rate ($/M)15Anthropic Sonnet-class list price
Step 5Read Run 1 — API baseline

Scroll up to the KPI tile row. With the values above, the calculator returns:

Run 1 — Public API
Cost per prediction1.35¢
Annual cost @ 5M/mo$810,000
Predictions per kWh— (mode-dependent)
Energy share— (hidden in vendor margin)
Annual carbon— (mode-dependent)
What you can't see in this run. Where the watts went. Where the margin went. Whether the bill is 80% energy or 80% vendor profit. The point of switching modes next is not necessarily to migrate — it's to make those line items visible so the API decision stops being a black box.
Step 6Switch to self-hosted mode

The compute, electricity, and infrastructure sections light up. Cloud margin auto-resets to 0%. The calculator now models the full unit cost of producing one prediction in-house.

ActionSample valueWhy
Click mode buttonSelf-hosted GPUMake the line items visible. Run the same workload through a transparent cost stack instead of a black-box per-token rate.
Step 7Verify compute configuration

GPU class drives both power draw and capex; both are editable independently. Utilization is the most-fudged input — be honest with it. An idle GPU still amortizes capex but produces no predictions.

FieldSampleSource of value
GPU classH100Standard inference workhorse for mid-tier LLMs
GPU power draw (W)700NVIDIA H100 SXM5 datasheet TDP
Server overhead (W)200Per-GPU share of host CPU, RAM, NICs, fans, storage
Effective utilization (%)70Steady-state production midpoint; bursty workloads run lower
GPU capex ($)30,000Enterprise channel street price for H100 SXM5
Server overhead capex ($)5,0001/8 of typical 8-GPU HGX node non-GPU cost
Useful lifetime (years)4Standard depreciation horizon for production GPUs
Step 8Verify electricity & cooling

The line item industry pricing pretends doesn't exist. PUE folds in cooling and facility losses; carbon intensity is optional and only affects the carbon KPI.

FieldSampleSource of value
Electricity rate ($/kWh)0.12US industrial average; PPA contracts can cut this in half
PUE (cooling overhead)1.4Typical enterprise data center; hyperscale runs 1.1–1.3
Carbon intensity (gCO₂/kWh)380US grid average; varies dramatically by region and time of day
Step 9Verify infrastructure & margin

The smaller line items. At enterprise volume they round to noise; at edge volume the software fixed cost can dominate.

FieldSampleSource of value
Network/storage (¢/prediction)0.005Order of magnitude for vector DB read + egress + blob I/O
Software / orchestration ($/mo)500Serving framework + observability + gateway
Cloud vendor margin (%)0Self-hosted = no margin; cloud GPU rental would set this to 30–50%
Step 10Read Run 2 — Self-hosted readout

The KPI row updates immediately. Scroll up to read it.

Run 2 — Self-hosted GPU
Cost per prediction0.081¢
Annual cost @ 5M/mo$48,454
Predictions per kWh~1,900
Energy share7.8% of CPP
Annual carbon~12 tCO₂e (31.5 MWh × 380 g)
Breakeven volume~92,000 predictions/mo
The breakeven is the framework's payoff. Self-host CPP equals API CPP at ~92K predictions per month. Below that, the API wins — capex amortizes over too few inferences. Above it, self-host wins by a margin that compounds with every additional prediction. At 5M/mo, the company is 54× past breakeven. The decision isn't whether to self-host — it's how fast.
Step 11Compare runs side by side

Both numbers in one frame. The delta is the dollar value of the deployment decision at this volume.

Run 1 — Public API
CPP1.35¢
Annual$810,000
Energy share— (hidden)
Carbon— (hidden)
Run 2 — Self-hosted
CPP0.081¢
Annual$48,454
Energy share7.8%
Carbon~12 tCO₂e
Delta: $761,546 saved annually — a 94% reduction off the API baseline. CPP is 16.7× lower per prediction.
What the framework just made visible. Hardware amortization is 74% of self-hosted CPP. Energy is 8%. The headline isn't "electricity is cheap" — at this volume it isn't even the largest line item. The headline is that the API was charging 16× the underlying production cost, and the markup was invisible until you ran the alternative model. Self-hosting is the right answer here, but the deeper finding is that without CPP you would not have known by how much.
Step 12Read the four charts

Numbers tell you the size of the bill. Charts tell you what's driving it and where the levers are.

CPP composition
Hardware bar dominates. Energy is small but visible. Cooling (PUE overhead) is roughly half the IT-load energy — the line item often forgotten in back-of-envelope math.
Scalability curve
Self-host CPP collapses as volume rises (hyperbolic). API line is flat. Crossing point is breakeven; the height of the gap above breakeven is the savings rate per prediction.
Energy decomposition
Doughnut: GPU draw / server overhead / cooling. At PUE 1.4, cooling is 29% of total energy — every watt of compute drags 0.4W of facilities behind it.
Sensitivity tornado
Annual $ delta from each lever applied individually. Tells you which assumption matters most — and where to focus measurement effort before committing.
Step 13Stress-test the assumption you trust least

Triple the electricity rate from $0.12 to $0.36/kWh — simulating a high-cost grid, no PPA, peak pricing exposure.

FieldStress valueResult
Electricity rate ($/kWh)0.36CPP rises from 0.081¢ to 0.093¢ (+15.6%); annual goes from $48,454 to $56,022; energy share jumps from 8% to 20%
Why electricity belongs on the balance sheet anyway. CPP is per-prediction. Aggregate matters. This 5M/month workload burns 31.5 MWh/year — invisible at the unit, real at the meter. Move to 5B/month with frontier models and the same configuration burns 31.5 GWh — a PPA conversation, a balance-sheet item, a procurement workstream. CPP is the lens that catches both regimes in one number.
Step 14Export

Capture configuration and computed results for finance review or board memo.

FormatWhat it's for
Download JSONFull nested structure. Re-importable into the calculator if archived. Best for programmatic use or version control.
Download CSVFlat key/value file. Opens directly in Excel or Sheets — no Power Query needed. Best for finance modeling, board memos, audit trails.
What you walk away with

An annual operating-cost number for both deployment paths, with documented assumptions. A breakeven volume that tells you whether the migration is on the table or theoretical. A sensitivity reading that surfaces which input matters most. A reproducible artifact for finance review. Token economics, capital allocation, and operating cost are now connected through a single unit.

Download the current configuration. JSON preserves the full structure for re-import. CSV is a flat key/value file that opens directly in Excel or Sheets — useful for finance modeling and audit trails.

Show every default value, its source, and its rationale
ValueDefaultSource & rationale
H100 SXM power draw700 WNVIDIA H100 SXM5 datasheet TDP. Real production draw under sustained inference load is typically 70–90% of TDP — the GPU rarely runs at thermal limit continuously.
H200 SXM power draw700 WNVIDIA H200 datasheet. Same TDP envelope as H100; difference is HBM3e capacity and bandwidth, not power.
A100 80GB power draw400 WNVIDIA A100 SXM4 80GB datasheet TDP.
L40S power draw350 WNVIDIA L40S datasheet TDP. Common choice for cost-optimized inference where HBM is overkill.
L4 power draw72 WNVIDIA L4 datasheet TDP. Edge-class inference accelerator.
H100 capex$30,000List/street pricing typical for SXM5 modules through enterprise channels. Negotiated volume pricing varies considerably.
Server overhead capex$5,000 / GPUPer-GPU share of chassis, dual CPU, 1–2TB RAM, NVMe, networking. For an 8-GPU HGX node, divide total non-GPU server cost by 8.
Server overhead power200 W / GPUPer-GPU share of host CPU, RAM, NICs, fans, storage. 8-GPU nodes typically draw 1.5–2 kW outside the GPUs themselves.
Useful lifetime4 yearsCommon depreciation horizon for production GPU clusters. Tax-accounting useful life often differs (Meta and Microsoft have publicly extended to 6 years for accounting purposes).
Effective utilization70%Steady-state production. Idle time is time GPUs amortize capex with no predictions returned. High-volume API providers achieve 80–90%; bursty enterprise workloads run 40–60%.
Electricity rate$0.12 / kWhApproximate US industrial average. Hyperscalers with long-term PPAs reach $0.04–0.07; high-cost regions and edge sites exceed $0.20.
PUE1.4Typical enterprise data center. Modern hyperscale runs 1.1–1.3 (Google reports fleet PUE ~1.10). Edge / on-premises closets often 1.6–1.8+.
Grid carbon intensity380 gCO₂ / kWhUS grid average; varies dramatically by region and time of day. CAISO daytime can be 200; coal-heavy grids exceed 700. Ember and ElectricityMap publish location-specific data.
Cloud vendor margin0% (self-host) / 30–50% (cloud)Markup over underlying energy + hardware cost in cloud GPU rental. Reverse-engineered from public hourly rates against datasheet TDP and street capex.
Network / storage / prediction0.005¢Order-of-magnitude estimate for vector DB read + egress + blob I/O on a typical RAG inference. Highly implementation-dependent; replace with measured value.
Software / orchestration$500/moOrder-of-magnitude for serving framework + observability + gateway. Material for small workloads, negligible at scale.
API rates (Sonnet-class default)$3 in / $15 out per M tokensAnthropic published list price for mid-tier model. Editable to match negotiated rates or alternative vendors.
Edge preset100K/mo, L4, 0.3s, $0.10/kWhSmall classifier or embedded chatbot at branch / store-level. Low volume; hardware amortization dominates CPP.
Enterprise preset5M/mo, H100, 1.5s, $0.12/kWhInternal RAG assistant or coding tool at mid-to-large company. The most common shape.
Production preset100M/mo, H100, 2s, $0.15/kWhCustomer-facing product feature at scale. Energy starts to register as a real line item.