The 4090 Bare Metal Playbook for 2026
Renting an RTX 4090 in the cloud costs between $0.16 and $0.69 per hour on container and marketplace platforms, with dedicated bare metal priced separately through providers like Barrack AI. That gap matters. A bare metal 4090 gives you the full 16,384 CUDA cores, 24 GB of GDDR6X, and ~1 TB/s of memory bandwidth with no hypervisor between your code and the silicon, which the community has measured at roughly 5 to 10 percent faster on sustained GPU workloads versus virtualized equivalents. For 7B to 13B model inference, SDXL and Flux image generation, and LoRA fine-tuning, a 4090 matches an A100 80GB on throughput while costing two to five times less per hour. The ceiling is 24 GB of VRAM and the lack of NVLink, which together push training of 70B+ models and high-concurrency production serving onto H100 or B200 hardware. This post lays out the numbers, the provider landscape, and where the 4090 actually earns its keep in an AI pipeline.
What an RTX 4090 actually is under the hood
The RTX 4090 launched October 12, 2022 at a $1,599 MSRP, built on NVIDIA's Ada Lovelace architecture using TSMC's custom 4N process. The AD102 die at its center carries 76.3 billion transistors on 608 mm² of silicon, though the shipping 4090 is a partially cut version of the full die. Core counts are specific and worth memorizing if you do GPU work: 16,384 CUDA cores, 512 fourth-generation Tensor Cores, 128 third-generation RT Cores, 128 streaming multiprocessors, and 72 MB of L2 cache. The card runs at a 2,235 MHz base clock and boosts to 2,520 MHz.
Memory is 24 GB of GDDR6X on a 384-bit bus running at 21 Gbps, delivering 1,008 GB/s of bandwidth. That is roughly half the bandwidth of an A100 80GB and about 30 percent of an H100 80GB SXM, which is the single most important number to understand when comparing the 4090 to data center cards for LLM inference. GeForce cards do not expose ECC on GDDR6X, so silent bit-flips remain a real, if small, concern for multi-day training runs.
Power draw is 450W TGP through a single 16-pin 12VHPWR connector, and NVIDIA recommends an 850W PSU. The card uses PCIe 4.0 x16 and, critically, has no NVLink. Ada Lovelace consumer SKUs dropped NVLink entirely, which means multi-GPU 4090 builds communicate only over the PCIe bus at roughly 32 GB/s bidirectional, compared to 900 GB/s on H100 SXM. That single architectural choice is what kills tensor-parallel scaling on 4090 clusters for 70B models.
Compute throughput on paper is substantial. FP32 runs at 82.6 TFLOPS, BF16 and FP16 Tensor Core performance lands around 165 TFLOPS dense (330 TFLOPS with 2:4 sparsity) when using FP32 accumulate, and FP8 Tensor performance is 330 TFLOPS dense / 660 TFLOPS sparse under the same condition. One important caveat the marketing pages skip: on consumer Ada, FP8 and FP16 matmul throughput with FP32 accumulate is throttled to half the rate of FP16 accumulate, so the 660 peak numbers assume accumulation modes that production training frameworks rarely use. Real-world cuBLASLt FP8 on a 4090 lands closer to 330 TFLOPS.
| Spec | RTX 4090 | A100 80GB SXM | H100 80GB SXM | L40S |
|---|---|---|---|---|
| CUDA cores | 16,384 | 6,912 | 16,896 | 18,176 |
| VRAM | 24 GB GDDR6X | 80 GB HBM2e ECC | 80 GB HBM3 ECC | 48 GB GDDR6 ECC |
| Memory bandwidth | 1.01 TB/s | 2.04 TB/s | 3.35 TB/s | 864 GB/s |
| FP32 TFLOPS | 82.6 | 19.5 | 67 | 91.6 |
| FP16/BF16 Tensor dense TFLOPS | ~165 | 312 | 989 | ~362 |
| FP8 Tensor dense TFLOPS | ~330 | none | 1,979 | 733 |
| NVLink | no | 600 GB/s | 900 GB/s | no |
| ECC | no | yes | yes | yes |
| TDP | 450W | 400W | 700W | 350W |
4090 cloud pricing across providers, April 2026
The 4090 cloud market is fragmented, and the spread is wide enough that picking the wrong provider can triple your bill. The table below consolidates on-demand pricing from provider pricing pages and aggregators.
Dedicated bare metal providers
| Provider | $/hr | Regions | Billing | Egress |
|---|---|---|---|---|
| Barrack AI | Contact for pricing | US, Canada, EU, India, APAC | Monthly/Yearly | Free |
Container and VM providers
| Provider | $/hr | Billing | Notes |
|---|---|---|---|
| TensorDock | $0.37 on-demand, $0.20 spot | Per-minute | KVM with GPU passthrough |
| RunPod Community | $0.34 | Per-second | Preemptible |
| CoreWeave | $0.42 | Hourly | Enterprise orchestration |
| Fluence | $0.53 | Three-hour min | Sources from TensorDock, Sesterce |
| RunPod Secure | $0.59-0.69 | Per-second | Enterprise SLAs |
| Paperspace | Limited inventory | Per-hour | Hourly minimum |
Marketplace tier (community-supplied hardware)
| Provider | $/hr | Notes |
|---|---|---|
| Vast.ai | $0.27-0.40 | Peer-to-peer marketplace. Median ~$0.31 |
| Salad | $0.16 | Distributed consumer hardware |
A note on Vast.ai and Salad: Both operate as peer-to-peer networks where individual hosts list their own consumer hardware. You're renting from strangers, not from a company operating its own data centers. Hardware quality, uptime guarantees, security posture, and network performance vary wildly by host. There is no single-tenant isolation. For production inference or anything touching sensitive data, this matters. For batch experimentation on throwaway workloads, it can be fine.
For context on the spread, GetDeploying tracks 114 RTX 4090 listings across 11 cloud providers with a market range of $0.18 to $1.61 per hour. Median on-demand pricing sits near $0.34/hr. Reserved multi-month commitments pull that down another 20 to 40 percent across most providers.
Why bare metal matters specifically for the 4090
Virtualization overhead on GPU workloads is not a rumor. Industry benchmarks peg the hypervisor tax at roughly 5 to 10 percent on sustained GPU throughput, and some testing reports ~20 to 25 percent better sustained-load performance on bare metal versus VPS-style sharing for Llama inference. For a consumer GPU already working hard against its 24 GB ceiling, that overhead is the difference between fitting a model with a useful KV cache and having to run a smaller quant.
Bare metal also removes the multi-tenant risks that matter more on consumer silicon than on MIG-partitioned data center cards. The 4090 cannot be sliced with MIG or vGPU. One physical card is always one tenant. On a container platform, multiple customers share the underlying host CPU, RAM, PCIe, and NVMe even if the GPU is dedicated, which produces noisy-neighbor effects on data loading, checkpointing, and any CPU-bound preprocessing. Recent GPU-side Rowhammer research on GDDR6 (GDDRHammer, GPUBreach) adds a security layer to this argument: dedicated hardware is the cleanest mitigation for cross-tenant attacks on consumer GPUs that lack the security features of data center silicon.
Kernel-level control is the other bare metal benefit that often gets left off pricing pages. You pick the driver version, the CUDA toolkit, the kernel, and you can load NVIDIA's open GPU kernel modules or exotic distro configurations. On VM or container platforms the host admin controls the driver branch and you inherit whatever nvidia-smi behavior and persistence mode they set.
The restrictions are real and worth naming. The 4090 does not support ECC, so long training runs carry some bit-flip risk that data center cards with HBM ECC do not. There is no NVLink, so tensor-parallel scaling for models larger than 24 GB is capped by PCIe 4.0 bandwidth. The 12VHPWR connector has a documented melting history at sustained load. And consumer cooler designs were never built for 24/7 rack duty cycles. A provider doing bare metal 4090 correctly is re-housing cards in server chassis with proper airflow, not bolting retail coolers into 1U boxes.
Where the 4090 beats A100, H100, and L40S on price-performance
For Llama 3 8B inference at batch size 1 on llama.cpp, an RTX 4090 produces 127.7 tokens/sec at Q4_K_M and 54.3 tokens/sec at FP16. An A100 PCIe 80GB manages 138.3 and 54.6 tokens/sec on the same workload. An H100 PCIe delivers 144.5 and 67.8 tokens/sec. For single-stream inference on a model that fits in VRAM, the 4090 lands within 8 percent of A100 at Q4 and matches it at FP16. The H100 pulls ahead only when memory bandwidth becomes the limit, which happens at FP16 with long contexts.
For prompt processing (the prefill phase, which is compute-bound rather than memory-bound), the 4090 hits 9,056 tokens/sec on FP16 Llama 3 8B, compared to 10,343 on H100 PCIe. That is a 14 percent gap, much tighter than the price gap between the two cards.
Quantized inference is where the 4090 genuinely shines. Llama 2 7B with AWQ INT4 hits 194 tokens/sec on a 4090 at batch 1, a 3.7x speedup over FP16. Llama 2 13B at AWQ INT4 hits 110 tokens/sec on the same card.
Stable Diffusion and Flux are a near-tie with A100 out of the box. Head-to-head testing across 12 benchmark variants covering SD, SDXL, and Flux found the two cards perform nearly identically, with each card winning roughly half the tests. On community clouds, a 4090 generates SDXL 1024x1024 images at ~15.6 seconds average including overhead, which works out to about 769 SDXL images per dollar at $0.27/hr. Flux.1 Schnell with FP8 optimizations on a 4090 runs in the 1 to 3 second range per image, yielding up to 5,243 Flux images per dollar.
Fine-tuning economics favor the 4090 for anything that fits. A typical QLoRA fine-tune of Llama 3 8B on a domain dataset runs 3 to 4 hours on a 4090. At $0.44/hr that is $1.30 to $1.75 per full fine-tune run. The same job on an A100 80GB runs 1.5 to 2x faster but at 3 to 4x the hourly rate, so net cost favors the 4090.
What fits in 24 GB, and what does not
The rule of thumb for LLM inference is 2 GB per billion parameters at FP16, 1 GB/B at FP8 or INT8, and roughly 0.5 GB/B at 4-bit. Add KV cache scaling with batch size and context length.
| Model | FP16 on 4090 | Notes |
|---|---|---|
| Phi-3 mini 3.8B | Fits easily | ~7.6 GB weights |
| Mistral 7B, Qwen 2.5 7B | Fits | ~14 GB, room for KV cache |
| Llama 3 / 3.1 8B | Fits | ~16 GB, tight at 128K context |
| Gemma 2 9B | Fits, tight | ~18 GB |
| Flux.1 Dev (12B) | Borderline | ~22 GB peak, T5 encoder often offloaded to CPU |
| Llama 3 13B | Does not fit at FP16 | ~26 GB, needs FP8 or 4-bit |
| Llama 3 70B | Does not fit | 140 GB at FP16 |
| Mixtral 8x7B | Does not fit at FP16 | ~93 GB, even Q4 (~26 GB) slightly overflows |
| SDXL, SD 1.5 | Fits easily | ~10 GB peak |
Quantization extends the envelope but not as far as the marketing implies. GPTQ and AWQ at 4-bit roughly halve memory versus FP8, putting 30B dense models in the 16 to 20 GB range with room for KV cache. GGUF Q4_K_M for Llama 3 70B is 42.5 GB, which does not fit on a 24 GB card even with all tricks enabled. Claims that "Llama 3 70B runs on a 4090" always rely on partial CPU offload through llama.cpp, which produces single-digit tokens/sec throughput. For 70B+ workloads, move to H100, H200, or B200 hardware.
QLoRA fine-tuning works comfortably up to 13B on a 4090 and can reach 30B with gradient checkpointing and small sequence lengths. Full-precision LoRA caps out around 7B with checkpointing. Flux.1 LoRA training in BF16 fits in 24 GB using Kohya.
4090 versus 5090: the 2026 question
The RTX 5090 launched January 30, 2025 at a $1,999 MSRP and promptly disappeared into a months-long supply crunch. On paper it is a real upgrade: 21,760 CUDA cores, 32 GB of GDDR7, 1,792 GB/s of memory bandwidth, fifth-generation Tensor Cores with native FP4 support, and a 575W TDP. Memory bandwidth is the headline number, up 78 percent over the 4090.
Real-world uplift over the 4090 sits at about 27 percent on gaming and FP16 AI workloads, with larger gains on memory-bandwidth-bound tasks. Flux FP8 image generation is roughly 75 percent faster, and with native FP4 the 5090 reaches over 4x the 4090's FP8 performance on some benchmarks.
So is the 4090 still worth renting with 5090s available? For most workloads, yes. Cloud 5090s run roughly 1.5 to 2x the rental rate of 4090s, which closes the gap on throughput-per-dollar. Both cards cap below the 80 GB threshold where data center hardware takes over, so a workload that does not fit on a 4090 usually does not fit on a 5090 either. Two years of Ada optimization also means the quantization kernels, vLLM paths, and Flash-Attention implementations on the 4090 are production-hardened, whereas Blackwell consumer software support has only recently matured. The 5090 wins cleanly when your workload sits in the 24 to 32 GB band, is memory-bandwidth limited, or can exploit native FP4.
The China ban and why cloud rental became the primary access path
On November 17, 2023 the RTX 4090 became subject to a US export ban to China after the Bureau of Industry and Security revised its controls using a Total Processing Performance metric. The 4090 scored a TPP of 5,285 against a 4,800 threshold. NVIDIA launched the RTX 4090D with 14,592 CUDA cores as a compliant alternative.
The ban created a grey market with Chinese retail pricing spiking to roughly $3,600 to $4,150. Chinese factories began stripping retail 4090s and resoldering the AD102 die onto blower-cooled reference PCBs suited to AI servers. Compounding the supply picture, NVIDIA reportedly ceased 4090 production in late 2024 to clear capacity for Blackwell. New retail stock has been thinning since. Used pricing in the US runs $1,100 to $1,400, new retail sits at $1,500 to $2,800 depending on AIB partner.
For teams that need 4090s at scale, purchasing dozens of cards from a shrinking supply is not realistic, and cloud rental became the primary access path.
Conclusion
The RTX 4090 in 2026 is the best price-performance GPU in the cloud for a specific and large category of work: inference and fine-tuning of models that fit in 24 GB, image generation at almost any scale, and research cycles where cost per experiment is the binding constraint. Production ended in late 2024, retail pricing inflated, and the China export ban removed a chunk of global supply, which is why cloud rental became the dominant access path rather than a convenience.
Pricing ranges from $0.16/hr on decentralized consumer networks to dedicated bare metal through providers like Barrack AI that offer single-tenant 4090 hardware across US, Canada, EU, India, and APAC with monthly/yearly billing and zero egress fees. Bare metal matters more on a consumer GPU than on a data center card because the margins for hypervisor overhead, noisy neighbors, and security isolation are thinner. For dedicated 4090 bare metal pricing, contact Barrack AI.
FAQ
How much does it cost to rent an RTX 4090 in the cloud?
On-demand 4090 pricing ranges from $0.16/hr on decentralized consumer networks like Salad to $0.69/hr on RunPod Secure. Marketplace rates on Vast.ai average around $0.31/hr but come from individual community hosts with variable reliability. Dedicated bare metal providers charge more for single-tenant hardware with guaranteed isolation. The cross-provider median tracked by GetDeploying is approximately $0.34/hr.
What is the difference between bare metal and container 4090 rental?
Bare metal gives you the full physical GPU card with no hypervisor or container runtime between your code and the silicon. You get kernel-level control, full CUDA access, and no noisy-neighbor effects. Container platforms like RunPod and Vast.ai typically dedicate the GPU but share the host CPU, RAM, and NVMe across tenants, which introduces 5 to 10 percent overhead on sustained GPU workloads and potential noisy-neighbor effects on data loading.
What AI models fit on a 4090's 24 GB VRAM?
At FP16: Llama 3 8B (~16 GB), Mistral 7B (~14 GB), Qwen 2.5 7B, Gemma 2 9B (~18 GB), SDXL (~10 GB), and Flux.1 Dev at the limit (~22 GB). At 4-bit quantization: models up to roughly 30B parameters. Llama 3 70B does not fit even at 4-bit (42.5 GB at Q4_K_M). Claims of running 70B on a 4090 always involve CPU offloading at single-digit tokens/sec.
Is the 4090 faster than the A100 for LLM inference?
For models that fit in 24 GB, it is within 8 percent of A100 at Q4 quantization and matches it at FP16 on single-stream inference. The A100 wins on workloads above 24 GB (where the 4090 cannot run them at all), on multi-GPU scaling (NVLink vs PCIe), and on concurrent batch serving where its 80 GB and 2 TB/s bandwidth provide headroom the 4090 lacks.
Should I rent a 4090 or an H100?
If your model fits in 24 GB and you are optimizing for cost per token or cost per image, the 4090 is 2 to 5x cheaper per hour and delivers comparable single-stream throughput. If your model exceeds 24 GB, needs tensor parallelism across GPUs, requires ECC memory for multi-day training, or serves concurrent requests at batch sizes above 32, the H100 is the correct card.
Is the 4090 still worth renting with the 5090 available?
For most workloads, yes. Cloud 5090s cost roughly 1.5 to 2x more per hour, and the 27 percent average performance uplift does not overcome the price gap on throughput-per-dollar for FP16 workloads. The 5090 wins when your workload sits in the 24 to 32 GB band, is memory-bandwidth limited, or can exploit native FP4 precision.
Why can't I just buy 4090s for my own cluster?
NVIDIA reportedly ceased 4090 production in late 2024. New retail stock is thinning and prices run $1,500 to $2,800+. The China export ban removed supply from the global market. Additionally, NVIDIA's GeForce EULA restricts data center deployment of consumer GPU drivers, though enforcement has been limited. Cloud rental solves the procurement, compliance, power, and cooling problems simultaneously.
Does Barrack AI offer bare metal 4090 instances?
Barrack AI provides dedicated bare metal 4090 instances with single-tenant isolation, monthly/yearly billing, and zero egress fees across US, Canada, EU, India, and APAC regions. For pricing, contact Barrack AI.
What workloads should NOT use a 4090?
Any model exceeding 24 GB at your target precision (Llama 70B, Mixtral 8x22B, DeepSeek-V3). Multi-GPU tensor-parallel training. Production inference endpoints requiring ECC reliability guarantees. High-concurrency serving where KV cache exceeds 24 GB. For these workloads, move to H100 ($2.00+/hr), H200, or B200 hardware.
How does QLoRA fine-tuning work on a 4090?
QLoRA fine-tuning works comfortably up to 13B parameters on a 4090. A typical Llama 3 8B QLoRA run takes 3 to 4 hours. At community cloud rates (~$0.44/hr) that is $1.30 to $1.75 per run. Full-precision LoRA caps out around 7B with gradient checkpointing. Flux.1 LoRA training in BF16 fits in 24 GB using Kohya.
