AI's relevance is not determined by hype, but by its unit economics. This calculator decomposes a single inference into its real cost drivers — electricity, hardware amortization, infrastructure, vendor margin — and shows where one deployment mode beats another. Built for the buy-vs-build conversation, not the production benchmark: what one prediction actually costs, which line item is doing the work, and a number documented well enough to survive a CFO review.
Three economic structures, three different shapes of bill. Self-hosted carries capex and energy; cloud GPU rental hides hardware behind an hourly rate with margin; public API hides everything behind a per-token line item.
Three reference inference workloads. Pick the closest match, then edit any value.
Volume, latency, and the prediction shape. Latency is the wall-clock seconds of GPU time per inference — for an LLM, this is dominated by output token count divided by tokens-per-second throughput.
GPU class, server overhead, and the capex you are amortizing across the lifetime. For cloud GPU rental, leave the capex fields populated to match the underlying machine — vendor margin is applied separately below.
The line item the rest of the industry pretends is invisible. PUE folds in cooling, lighting, and facility losses — a 1.5 PUE means every watt of compute drags a half watt of overhead behind it.
Network egress, storage, orchestration, and — for cloud GPU mode — the vendor markup applied on top of the underlying compute.
The flat-rate alternative. Per-token pricing converts directly to a per-prediction cost when you fix the input/output shape. Used in API mode as the primary cost; used in self-hosted mode as the breakeven reference line.
A worked example with sample data. Replace any value with your own as you go.
Three economic structures. Run the API baseline first to set the comparison number — every other configuration gets measured against it.
| Action | Sample value | Why |
|---|---|---|
| Click mode button | Public API | Establish baseline before exploring alternatives. The API is the path of least resistance and what most teams default to without examining unit economics. |
Sets workload volume, latency, hardware defaults, electricity, and API rates in one click. Every value remains editable.
| Action | Sample value | Why |
|---|---|---|
| Click preset button | Enterprise RAG · mid model | Closest match to a 5M/month internal assistant on a Sonnet-class model. Other presets cover edge inference (100K/mo) and large-scale production (100M/mo). |
The numbers that drive every other calculation. Override the preset where it doesn't match measured reality.
| Field | Definition | Sample | Source of value |
|---|---|---|---|
| Predictions per month | Inferences served at steady state | 5,000,000 | Product analytics: ~165K/day |
| Latency per prediction (sec) | Wall-clock GPU time per inference | 1.5 | p50 measured against current API |
| Prediction shape | Informational label only | LLM generation | Token-streaming workload |
Token shape × per-token rate = the API CPP that becomes the breakeven reference line. This is the number self-hosting must beat.
| Field | Sample | Source of value |
|---|---|---|
| Input tokens / prediction | 2,000 | System prompt + RAG context + user query |
| Output tokens / prediction | 500 | Average model response length |
| Input rate ($/M) | 3 | Anthropic Sonnet-class list price |
| Output rate ($/M) | 15 | Anthropic Sonnet-class list price |
Scroll up to the KPI tile row. With the values above, the calculator returns:
| Cost per prediction | 1.35¢ |
| Annual cost @ 5M/mo | $810,000 |
| Predictions per kWh | — (mode-dependent) |
| Energy share | — (hidden in vendor margin) |
| Annual carbon | — (mode-dependent) |
The compute, electricity, and infrastructure sections light up. Cloud margin auto-resets to 0%. The calculator now models the full unit cost of producing one prediction in-house.
| Action | Sample value | Why |
|---|---|---|
| Click mode button | Self-hosted GPU | Make the line items visible. Run the same workload through a transparent cost stack instead of a black-box per-token rate. |
GPU class drives both power draw and capex; both are editable independently. Utilization is the most-fudged input — be honest with it. An idle GPU still amortizes capex but produces no predictions.
| Field | Sample | Source of value |
|---|---|---|
| GPU class | H100 | Standard inference workhorse for mid-tier LLMs |
| GPU power draw (W) | 700 | NVIDIA H100 SXM5 datasheet TDP |
| Server overhead (W) | 200 | Per-GPU share of host CPU, RAM, NICs, fans, storage |
| Effective utilization (%) | 70 | Steady-state production midpoint; bursty workloads run lower |
| GPU capex ($) | 30,000 | Enterprise channel street price for H100 SXM5 |
| Server overhead capex ($) | 5,000 | 1/8 of typical 8-GPU HGX node non-GPU cost |
| Useful lifetime (years) | 4 | Standard depreciation horizon for production GPUs |
The line item industry pricing pretends doesn't exist. PUE folds in cooling and facility losses; carbon intensity is optional and only affects the carbon KPI.
| Field | Sample | Source of value |
|---|---|---|
| Electricity rate ($/kWh) | 0.12 | US industrial average; PPA contracts can cut this in half |
| PUE (cooling overhead) | 1.4 | Typical enterprise data center; hyperscale runs 1.1–1.3 |
| Carbon intensity (gCO₂/kWh) | 380 | US grid average; varies dramatically by region and time of day |
The smaller line items. At enterprise volume they round to noise; at edge volume the software fixed cost can dominate.
| Field | Sample | Source of value |
|---|---|---|
| Network/storage (¢/prediction) | 0.005 | Order of magnitude for vector DB read + egress + blob I/O |
| Software / orchestration ($/mo) | 500 | Serving framework + observability + gateway |
| Cloud vendor margin (%) | 0 | Self-hosted = no margin; cloud GPU rental would set this to 30–50% |
The KPI row updates immediately. Scroll up to read it.
| Cost per prediction | 0.081¢ |
| Annual cost @ 5M/mo | $48,454 |
| Predictions per kWh | ~1,900 |
| Energy share | 7.8% of CPP |
| Annual carbon | ~12 tCO₂e (31.5 MWh × 380 g) |
| Breakeven volume | ~92,000 predictions/mo |
Both numbers in one frame. The delta is the dollar value of the deployment decision at this volume.
| CPP | 1.35¢ |
| Annual | $810,000 |
| Energy share | — (hidden) |
| Carbon | — (hidden) |
| CPP | 0.081¢ |
| Annual | $48,454 |
| Energy share | 7.8% |
| Carbon | ~12 tCO₂e |
Numbers tell you the size of the bill. Charts tell you what's driving it and where the levers are.
Triple the electricity rate from $0.12 to $0.36/kWh — simulating a high-cost grid, no PPA, peak pricing exposure.
| Field | Stress value | Result |
|---|---|---|
| Electricity rate ($/kWh) | 0.36 | CPP rises from 0.081¢ to 0.093¢ (+15.6%); annual goes from $48,454 to $56,022; energy share jumps from 8% to 20% |
Capture configuration and computed results for finance review or board memo.
| Format | What it's for |
|---|---|
| Download JSON | Full nested structure. Re-importable into the calculator if archived. Best for programmatic use or version control. |
| Download CSV | Flat key/value file. Opens directly in Excel or Sheets — no Power Query needed. Best for finance modeling, board memos, audit trails. |
An annual operating-cost number for both deployment paths, with documented assumptions. A breakeven volume that tells you whether the migration is on the table or theoretical. A sensitivity reading that surfaces which input matters most. A reproducible artifact for finance review. Token economics, capital allocation, and operating cost are now connected through a single unit.
Download the current configuration. JSON preserves the full structure for re-import. CSV is a flat key/value file that opens directly in Excel or Sheets — useful for finance modeling and audit trails.
| Value | Default | Source & rationale |
|---|---|---|
| H100 SXM power draw | 700 W | NVIDIA H100 SXM5 datasheet TDP. Real production draw under sustained inference load is typically 70–90% of TDP — the GPU rarely runs at thermal limit continuously. |
| H200 SXM power draw | 700 W | NVIDIA H200 datasheet. Same TDP envelope as H100; difference is HBM3e capacity and bandwidth, not power. |
| A100 80GB power draw | 400 W | NVIDIA A100 SXM4 80GB datasheet TDP. |
| L40S power draw | 350 W | NVIDIA L40S datasheet TDP. Common choice for cost-optimized inference where HBM is overkill. |
| L4 power draw | 72 W | NVIDIA L4 datasheet TDP. Edge-class inference accelerator. |
| H100 capex | $30,000 | List/street pricing typical for SXM5 modules through enterprise channels. Negotiated volume pricing varies considerably. |
| Server overhead capex | $5,000 / GPU | Per-GPU share of chassis, dual CPU, 1–2TB RAM, NVMe, networking. For an 8-GPU HGX node, divide total non-GPU server cost by 8. |
| Server overhead power | 200 W / GPU | Per-GPU share of host CPU, RAM, NICs, fans, storage. 8-GPU nodes typically draw 1.5–2 kW outside the GPUs themselves. |
| Useful lifetime | 4 years | Common depreciation horizon for production GPU clusters. Tax-accounting useful life often differs (Meta and Microsoft have publicly extended to 6 years for accounting purposes). |
| Effective utilization | 70% | Steady-state production. Idle time is time GPUs amortize capex with no predictions returned. High-volume API providers achieve 80–90%; bursty enterprise workloads run 40–60%. |
| Electricity rate | $0.12 / kWh | Approximate US industrial average. Hyperscalers with long-term PPAs reach $0.04–0.07; high-cost regions and edge sites exceed $0.20. |
| PUE | 1.4 | Typical enterprise data center. Modern hyperscale runs 1.1–1.3 (Google reports fleet PUE ~1.10). Edge / on-premises closets often 1.6–1.8+. |
| Grid carbon intensity | 380 gCO₂ / kWh | US grid average; varies dramatically by region and time of day. CAISO daytime can be 200; coal-heavy grids exceed 700. Ember and ElectricityMap publish location-specific data. |
| Cloud vendor margin | 0% (self-host) / 30–50% (cloud) | Markup over underlying energy + hardware cost in cloud GPU rental. Reverse-engineered from public hourly rates against datasheet TDP and street capex. |
| Network / storage / prediction | 0.005¢ | Order-of-magnitude estimate for vector DB read + egress + blob I/O on a typical RAG inference. Highly implementation-dependent; replace with measured value. |
| Software / orchestration | $500/mo | Order-of-magnitude for serving framework + observability + gateway. Material for small workloads, negligible at scale. |
| API rates (Sonnet-class default) | $3 in / $15 out per M tokens | Anthropic published list price for mid-tier model. Editable to match negotiated rates or alternative vendors. |
| Edge preset | 100K/mo, L4, 0.3s, $0.10/kWh | Small classifier or embedded chatbot at branch / store-level. Low volume; hardware amortization dominates CPP. |
| Enterprise preset | 5M/mo, H100, 1.5s, $0.12/kWh | Internal RAG assistant or coding tool at mid-to-large company. The most common shape. |
| Production preset | 100M/mo, H100, 2s, $0.15/kWh | Customer-facing product feature at scale. Energy starts to register as a real line item. |