Skip to main content

H200 Price in 2026: What You Actually Pay Per Hour

· 15 min read
Dhayabaran V
Barrack AI

The NVIDIA H200 rents for anywhere from $2.00 to $13.78 per GPU-hour in April 2026, a roughly 6.5x spread driven almost entirely by whether you rent from a hyperscaler VM or a specialist bare-metal provider. That gap matters because the H200 is the same Hopper silicon as the H100 with a bigger, faster memory stack, so every dollar saved on the hardware flows straight to your tokens-per-dollar math. Since CoreWeave first brought H200 to general availability in August 2024, supply has broadened to 20+ providers, yet median on-demand pricing has actually climbed roughly 28% year-over-year as Llama 3 70B and Mixtral 8x22B inference workloads keep demand stubbornly high. The practical question is no longer "can I get an H200?" but "which tier of the market should I buy from, and is it still the right chip with Blackwell shipping?" This post answers both with current numbers.

What the H200 actually is, and why its memory is the whole point

The H200 is a memory refresh of the H100, not a new architecture. Both GPUs use the identical GH100 die with 16,896 CUDA cores and 528 fourth-generation Tensor Cores, and both deliver the same 1,979 TFLOPS of FP8 dense compute (3,958 TFLOPS with sparsity). What changed is the HBM stack: H200 SXM ships with 141 GB of HBM3e at 4.8 TB/s, versus the H100 SXM's 80 GB HBM3 at 3.35 TB/s. That's 76% more capacity and 43% more bandwidth inside the same 700W power envelope and SXM5 mechanical form factor.

The PCIe variant, H200 NVL, keeps the 141 GB HBM3e and 4.8 TB/s bandwidth but lowers max TDP to 600W, drops Tensor Core throughput roughly 15% (1,671 TFLOPS FP16 sparse, 3,341 TFLOPS FP8 sparse), and ships as a double-wide passive card designed for air-cooled enterprise servers. NVIDIA bundles a 5-year NVIDIA AI Enterprise subscription with every NVL card. Both variants use PCIe Gen5 x16 and fourth-generation NVLink at 900 GB/s bidirectional, and both support MIG partitioning (up to seven instances at ~18 GB each on SXM).

SpecH100 SXMH200 SXMH200 NVL (PCIe)B200 (for context)
Memory80 GB HBM3141 GB HBM3e141 GB HBM3e192 GB HBM3e
Bandwidth3.35 TB/s4.8 TB/s4.8 TB/s8.0 TB/s
TDP700 W700 W600 W~1,000 W
FP8 Tensor (sparse)3,958 TFLOPS3,958 TFLOPS3,341 TFLOPS10,000 TFLOPS
FP16/BF16 (sparse)1,979 TFLOPS1,979 TFLOPS1,671 TFLOPS5,000 TFLOPS
NVLink900 GB/s900 GB/s900 GB/s (via bridge)1,800 GB/s
PCIeGen5 x16Gen5 x16Gen5 x16Gen5 x16

The practical consequence is stark. For compute-bound prefill on models that already fit in 80 GB, H200 offers essentially zero uplift over H100. For memory-bound decode and any workload where the weights or KV cache exceed 80 GB, H200 is a different class of chip entirely.

H200 cloud pricing across 20+ providers, April 2026

Pricing sorts cleanly into three tiers. The data below reflects on-demand rates, not reserved or committed pricing, as of mid-April 2026. Where providers sell 8-GPU nodes exclusively, we've normalized to per-GPU-hour.

Provider$/GPU-hrForm factorBillingEgress
Barrack AI (US, Canada, EU, India, APAC)from $2.00Bare metal, single-tenantMonthly/YearlyFree
FluidStack$2.30 (commit)Bare metalHourlyFree
GMI Cloud$2.50-3.35Bare metalHourly--
Genesis Cloud$2.80Bare metal (8-GPU min)HourlyFree
Paperspace / DO Gradient$3.44VM + bare metalHourlyFree
Civo$3.49K8s VMHourlyFree
Nebius$3.50KubeVirt VM (near-BM)Per-minute$0.015/GiB std
Jarvislabs$3.80VM (1-GPU)Per-minute--
RunPod (Secure)$3.99ContainerPer-secondVaries
Crusoe$4.29Light VMPer-minuteFree
Spheron$4.54Bare metalPer-minute--
Together AI$4.99Dedicated VMPer-minute--
AWS p5e.48xlarge$4.98 (Capacity Blocks)Nitro VM1-day minStandard AWS
Fireworks AI~$6.00Dedicated VMPer-second--
CoreWeave$6.31Bare-metal K8sHourly/minute--
Oracle BM.GPU.H200.8$10.00Bare metalPer-second10 TB/mo free
Azure ND H200 v5$10.60-13.78Hyper-V VMPer-secondStandard Azure
GCP a3-ultragpu-8g~$10.87VMPer-secondStandard GCP

A note on Vast.ai: Vast.ai lists H200 rates around ~$2.29-2.32/hr, but it operates as a peer-to-peer marketplace where individual hosts list their own hardware. You're renting from strangers, not from a company operating its own data centers. Hardware quality, uptime guarantees, security posture, and network performance vary wildly by host. There is no single-tenant isolation. For production inference or anything touching sensitive data, this matters. For batch experimentation on throwaway workloads, it can be fine.

The cross-provider median tracked by GetDeploying is $3.89/GPU-hr on-demand, with reserved averaging $3.09 and spot $2.91. AWS notably does not offer true on-demand H200 at all. P5e instances sell only via EC2 Capacity Blocks with a one-day minimum, and AWS raised those rates roughly 15% in January 2026. Azure's ND H200 v5 is the most expensive meaningful option on the market. Among specialist clouds, Nebius at $3.50 is the cheapest self-serve listed price with no waitlist, while Barrack AI's from $2.00/hr is the lowest rate on a dedicated bare-metal single-tenant H200 with free egress, available across US, Canada, EU, India, and APAC.

Benchmarks that actually justify the H200 premium

NVIDIA's headline number is 1.9x faster Llama 2 70B inference versus H100, measured with TensorRT-LLM using FP8 precision, 2,048-token input and 128-token output. Critically, that config runs batch size 32 on H200 versus batch size 8 on H100. The uplift comes largely from the H200's ability to hold more of the KV cache in memory, not from raw compute. On GPT-3 175B with 8-way tensor parallelism, NVIDIA measures 1.6x higher throughput. First-token latency barely moves, because prefill is compute-bound on identical silicon.

MLPerf Inference v4.1 (August 2024) put specific numbers on those claims: an 8x H200 system delivered 32,790 tokens/sec server mode on Llama 2 70B, versus roughly 24,525 tok/s for 8x H100, a 1.34x per-round uplift once both generations received equivalent software optimization. CoreWeave's submission hit 33,000 tok/s. In MLPerf Training v4.0, H200 finished Llama 2 70B LoRA fine-tuning in 24.7 minutes on a single 8-GPU node, a 14% improvement over H100, and ran 47% faster than H100 on the GNN benchmark.

Third-party production benchmarks tell a more nuanced story. RunPod measures a 0-11% uplift for models under 80 GB, 47% for BF16 large-batch inference that overflows H100 memory, and up to 3.4x when the H100 literally cannot fit the workload. A Medium/Data Science Collective test at 8K input + 8K output context measured 1.83x to 2.14x H200 advantage across three large models. The honest summary: the H200's real-world uplift is roughly 1.3-1.5x for workloads H100 can run comfortably, and 1.8-3.4x for workloads where memory becomes the wall.

Where 141 GB actually matters

The H200's value proposition collapses to one sentence: it lets a single GPU do what an H100 needs two GPUs to do, and that eliminates tensor-parallel communication overhead. The models that benefit most are the ones whose weights plus realistic KV caches sit in the 80-141 GB window:

  • Llama 3/3.1/3.3 70B at FP16 weighs roughly 140 GB and barely fits on a single H200. On H100 you need two GPUs with tensor parallelism, losing ~10-20% to NVLink synchronization.
  • Mixtral 8x22B (141B total parameters, 39B active) fits on one H200 at FP8 with headroom for KV cache; on H100 it requires two-GPU TP even in FP8.
  • Llama 3 70B at FP8 serving long context: a single 128K request's KV cache is ~40 GB, and four concurrent 128K requests blow through 160 GB. H200 keeps these workloads single-GPU; H100 forces you to either shard or cap concurrency.
  • DeepSeek-V3 and R1 (671B parameters) fit on a single 8x H200 HGX node at FP8 (~1,128 GB aggregate HBM) with room for KV cache. An 8x H100 node at 640 GB is too tight.
  • Llama 3.1 405B fits in BF16 on 8x H200 NVL without spanning nodes, a clean single-node deployment that 8x H100 cannot achieve without FP8 quantization.

Conversely, if you're running Llama 3 8B, Mistral 7B, Gemma 9B, or any 70B model that's already INT4-quantized at short context, H100 is usually 20-30% cheaper and performs within 10%. For compute-bound prefill, pay for H100.

Bare metal versus virtualized: the H200 case is unusually strong

Hypervisor overhead typically costs 5-15% of GPU performance, with vendor-sponsored benchmarks claiming up to 30% for large transformer training and Character.AI publicly attributing a 13.5x cost advantage in inference economics to bare metal. That overhead matters more for H200 than for almost any other accelerator, because the chip's entire premium over H100 is memory bandwidth and NVLink fabric access, precisely what virtualization layers tax the most.

The market has bifurcated accordingly. Oracle is the only major hyperscaler offering bare-metal H200 (BM.GPU.H200.8 at $10/GPU-hr); AWS p5e, Azure ND H200 v5, and GCP a3-ultragpu-8g all sit behind Nitro, Hyper-V, or Google's VM stack respectively. Specialist neoclouds have taken the opposite path: CoreWeave, Lambda, Crusoe, Nebius, GMI Cloud, and Barrack AI all deliver dedicated or near-dedicated hardware, and most are cheaper than any hyperscaler VM. The "bare metal is premium-priced" assumption that held in 2022 is dead for H200. Today bare metal is both cheaper and faster across most of the specialist tier.

Egress compounds the gap. A production inference endpoint shipping 50 TB/month of tokens to end users runs roughly $4,200-5,000/month in egress alone on AWS, Azure, or GCP. Providers with zero egress fees (Crusoe, Lambda, Barrack AI, Genesis Cloud, DigitalOcean, FluidStack) remove that line item entirely. For high-throughput serving, egress can eclipse compute spend; Gartner estimates egress consumes 10-15% of typical cloud bills and occasionally 40%.

Is H200 still the right buy with Blackwell shipping?

B200 reached volume availability through most major clouds by mid-2025, and B300 (Blackwell Ultra) began shipping in January 2026 with 288 GB HBM3e, 8 TB/s bandwidth, and roughly 1.5x the FP4 compute of B200. On-demand B200 now rents for $4.99-6.25/GPU-hr at RunPod, Lambda, and Modal, with spot rates as low as $2.05. B300 appears at $4.95-18/hr depending on provider, with Spheron spot at $2.90.

That creates three decision zones. First, if your workload uses FP4 inference (native Blackwell Transformer Engine), B200 delivers 2-3x H200 throughput and wins on $/token despite the higher hourly rate. Second, if you're running FP8 or BF16 at scale (which covers most production LLM inference today), H200 at $2-4/hr on specialist clouds remains the best price-performance point, because Blackwell's advantage narrows to roughly 1.2-1.5x in those precisions. Third, if you're fine-tuning or training on Hopper-optimized stacks (most of the open-source ecosystem), H200 remains the lowest-friction choice.

Counterintuitively, H200 pricing has risen slightly through 2025-2026 rather than dropping, because Llama 3 70B and Mixtral inference demand has outpaced the Blackwell rollout and NVIDIA received conditional January 2026 export clearance for H200 sales to China. The historical pattern of 15% list-price cuts within six months of a successor's launch has not materialized here. Expect H200 to remain a workhorse tier through at least 2027.

Conclusion: the H200 price you pay is a choice, not a market rate

The 6.5x spread from $2.00 to $13.78 per hour is not a pricing failure. It's the market segmenting by what you actually need. Hyperscaler VMs at $10+ buy you enterprise procurement, compliance checkboxes, and integration with the rest of AWS/Azure/GCP. Mid-tier specialists at $4-6 buy you mature tooling and managed Kubernetes. The sub-$3 bare-metal tier, where Barrack AI's from $2.00/hr bare-metal offering with monthly/yearly billing, zero egress, and availability across US, Canada, EU, India, and APAC sits alongside Genesis Cloud and FluidStack, buys you the raw silicon with hypervisor overhead removed and the egress tax deleted.

The H200 itself has become the defining chip for the memory-bound era of LLM inference: the one that turns 70B-class models into single-GPU problems and makes Mixtral 8x22B a single-card deployment. Blackwell will eventually displace it for FP4-native workloads, but for most of 2026 the question isn't whether H200 is worth renting. It's whether you're overpaying for it. Given that a bare-metal H200 from $2.00/hr delivers the same silicon as a $10.60/hr Azure VM, with 5-15% better performance and no egress meter running, the answer for most workloads points decisively toward the specialist tier.


FAQ

How much does an NVIDIA H200 GPU cost per hour in 2026?

On-demand H200 pricing ranges from approximately $2.00/hr on specialist bare-metal providers like Barrack AI to $13.78/hr on Azure ND H200 v5 VMs. The cross-provider median is $3.89/hr. Hyperscalers (AWS, Azure, GCP) charge $5-14/hr. Specialist bare-metal clouds charge $2-4/hr. The price depends on whether you're renting virtualized or bare-metal, and whether the provider charges egress fees.

What is the difference between H200 SXM and H200 NVL?

Both have 141 GB HBM3e at 4.8 TB/s bandwidth. The SXM variant is a 700W mezzanine card designed for liquid-cooled HGX baseboard systems with full NVLink mesh. The NVL variant is a 600W double-wide PCIe card for air-cooled servers, with roughly 15% lower Tensor Core throughput (3,341 vs 3,958 TFLOPS FP8 sparse). NVL includes a 5-year NVIDIA AI Enterprise subscription. Most cloud providers offer the SXM variant.

How much faster is the H200 compared to the H100?

For memory-bound workloads (LLM inference decode, large KV caches), H200 delivers 1.3-1.9x higher throughput than H100. For workloads that exceed 80 GB VRAM and require tensor parallelism on H100 but fit on a single H200, the advantage reaches 1.8-3.4x. For compute-bound workloads on models under 80 GB, the uplift is 0-11%. The H200 uses the same compute silicon as H100; all gains come from the larger, faster HBM3e memory.

Is the H200 worth renting over the H100?

Yes, if your model's weights plus KV cache exceed 80 GB at your target precision. Llama 3 70B at FP16 (~140 GB), Mixtral 8x22B at FP8, and any 70B model serving long-context (128K+) requests benefit substantially. No, if you're running sub-80 GB workloads like Llama 3 8B, Mistral 7B, or INT4-quantized 70B at short context. In that case, H100 at $1.49-2.69/hr delivers near-identical performance for less money.

Should I rent H200 or wait for B200/B300?

B200 is already available at $4.99-6.25/hr on-demand. B300 started shipping in January 2026. If your workload uses FP4 inference natively, B200 wins on $/token. If you're running FP8 or BF16, which covers most production inference today, H200 at $2-4/hr on specialist clouds still delivers the best price-performance. The open-source ecosystem (vLLM, TensorRT-LLM, SGLang) is fully optimized for Hopper; Blackwell-specific optimizations are still maturing.

What is the H200 hardware purchase price?

Individual H200 SXM GPUs are not sold separately. The standard procurement unit is the HGX H200 8-GPU server, with system prices reported between $250,000 and $500,000 depending on vendor and configuration. The H200 NVL PCIe card lists around $32,000-40,000 per card through enterprise OEMs like PNY, Exxact, and Supermicro.

What models fit on a single H200 that don't fit on a single H100?

Llama 3/3.1/3.3 70B at FP16 (~140 GB), Mixtral 8x22B at FP8 with KV cache headroom, and Llama 3 70B at FP8 with 4+ concurrent 128K-context requests. On H100 (80 GB), all of these require two-GPU tensor parallelism, which adds 10-20% communication overhead and doubles your GPU cost.

What is the cheapest way to rent an H200 GPU?

As of April 2026, the lowest-priced dedicated bare-metal H200 is Barrack AI at approximately $2.00/GPU-hr, billed monthly or yearly, with zero egress fees, with availability across US, Canada, EU, India, and APAC regions. Vast.ai lists lower headline rates (~$2.29/hr) but operates as a peer-to-peer marketplace where you rent from individual community hosts with no single-tenant isolation or consistent uptime guarantees. FluidStack offers $2.30/hr with commitment. Spot pricing on various providers can go below $2.00 but with preemption risk.

Does bare metal actually perform better than virtualized H200?

Yes. Hypervisor overhead typically costs 5-15% of GPU performance, with some benchmarks showing up to 30% loss for large transformer workloads. This matters disproportionately for H200 because the chip's advantage over H100 is entirely memory bandwidth and NVLink access, which are the resources virtualization layers tax most heavily. Oracle is the only hyperscaler offering bare-metal H200; most specialist clouds (CoreWeave, Lambda, Crusoe, Barrack AI) default to dedicated or near-dedicated hardware.

Why hasn't H200 pricing dropped with Blackwell available?

Two factors. First, Llama 3 70B and Mixtral inference demand has grown faster than Blackwell supply has ramped, keeping H200 utilization high. Second, NVIDIA received conditional January 2026 export clearance for H200 shipments to China, creating additional demand. The historical pattern of 15% list-price declines within six months of a successor GPU's launch has not materialized for H200.