Serverless GPUs: RTX Pro 6000, H200, and B200 Now Available on Koyeb
Today we’re announcing NVIDIA RTX Pro 6000, H200, and B200 available serverlessly on Koyeb.
These new GPU instances enable high-performance inference for compute-intensive workloads that are memory-bound, latency-sensitive, or throughput-constrained, including long-context and large-model serving.
With price ranging from $2.20/hr to $5.50/hr, all GPU instances are billed by the second, so you can build, experiment, and autoscale with no infrastructure management while avoiding idle cost or unexpected bills.
| RTX Pro 6000 | H200 | B200 | |
|---|---|---|---|
| Architecture | NVIDIA Blackwell | NVIDIA Hopper | NVIDIA Blackwell |
| GPU Memory (VRAM) | 96 GB | 141 GB | 180 GB |
| VRAM Bandwidth | 1.6 TB/s | 4.8 TB/s | 8 TB/s |
| RAM | 220 GB | 240 GB | 240 GB |
| Dedicated vCPU | 15 | 15 | 15 |
| Disk | 240 GB | 320 GB | 320 GB |
| Price | $2.20/hr | $3.00/hr | $5.50/hr |
H200 and RTX Pro 6000 are accessible instantly. Start running your workload right away!
Get started with GPUs
The fastest way to get started is to deploy one of the ready-to-run models available in the one-click catalog. Just pick a model, choose a GPU, and Koyeb handles the infrastructure for you, no setup required.

You can also deploy your own models and workloads using the Koyeb CLI or the Koyeb Dashboard. As always, we support multiple deployment paths:
- Pre-built containers with your model and inference server
- Connect your GitHub repository and let Koyeb automatically handle the build and deployment
RTX Pro 6000
With 96 GB of VRAM and native FP4 support, the RTX Pro 6000 Server Edition is ideal for agentic AI systems, generative applications, AI-driven rendering, and data-heavy analytics workloads where latency, cost efficiency, and flexibility matter.
| RTX Pro 6000 | |
|---|---|
| GPU Memory (VRAM) | 96 GB |
| VRAM Bandwidth | 1.6 TB/s |
| FP4 AI Compute (Peak) | 4 PFLOPS |
| FP32 | 120 TFLOPS |
| Cuda Cores | 24 064 |
| Tensor Cores | 752 |
H200
The H200 is an evolution of the well known H100, delivering approximately 75% more GPU memory and 40% higher memory bandwidth to support larger-scale AI models.
H200 excels at running very large models (100B+ parameters), sustaining high batch throughput for latency-sensitive inference, and efficiently processing long input sequences with tens of thousands of tokens.
| H100 | H200 | |
|---|---|---|
| GPU Memory (VRAM) | 94 GB HBM3 | 141 GB HBM3e |
| VRAM Bandwidth | 3.9 TB/s | 4.8 TB/s |
| Interconnect with NVIDIA NVLink and NVSwitch | 600 GB/s | 900GB/s |
| FP16 Compute (Dense) | 835,5 TFLOPS | 835,5 TFLOPS |
| BFLOAT16 Compute (Dense) | 835,5 TFLOPS | 835,5 TFLOPS |
| FP8 Compute (Dense) | 1670,5 TFLOPS | 1670,5 TFLOPS |
| INT8 Compute (Dense) | 1670,5 TFLOPS | 1670,5 TFLOPS |
B200
NVIDIA B200, powered by the Blackwell architecture, is built for ultra-large model inference at scale, combining 180 GB of HBM3e memory with 8 TB/s of bandwidth and exceptional FP8 and FP4 Tensor Core performance. This makes it a strong fit for high-throughput, low-latency inference on the largest LLMs and multimodal models, especially in multi-GPU setups where performance per node directly reduces cost and time-to-response.
| B200 | |
|---|---|
| GPU Memory (VRAM) | 180 GB HBM3e |
| VRAM Bandwidth | 8 TB/s |
| FP8 Tensor (Dense) | 36 PFLOPS |
| FP4 Tensor (Dense) | 72 PFLOPS |
Multi-GPU configurations
The H200 instance is available in multi-GPU configurations (2×, 4×, and 8×). These configurations make it easy to scale vertically for distributed inference, large-batch processing, or multi-model deployment, while keeping Koyeb’s serverless experience with no idle cost and predictable, usage-based pricing.
| VRAM | vCPU | RAM | Disk | Price | |
|---|---|---|---|---|---|
| H200 | 141 GB | 15 | 240 GB | 320 GB | $3.00/hr |
| 2× H200 | 282 GB | 30 | 480 GB | 640 GB | $6.00/hr |
| 4× H200 | 564 GB | 60 | 960 GB | 1 280 GB | $12.00/hr |
| 8× H200 | 1 128 GB | 120 | 1920 GB | 2 560 GB | $24.00/hr |
Why Koyeb for serverless GPUs?
These new GPUs build on what makes Koyeb different. All GPU instances are fully serverless, and come with:
- Docker container deployment from any registry using the Koyeb API, CLI, or control panel.
- Scale-to-Zero with reactive autoscaling based on requests per second, concurrent connections, or P95 response time.
- Pay only for what you use, per second.
- Built-in observability, metrics and more.
- Dedicated GPU performance without managing underlying infrastructure.
With support for RTX 400 SFF ADA, RTX A6000, L40S, A100, H100, H200, RTX Pro 6000, B200, and next-gen AI accelerators like Tenstorrent, Koyeb provides a broad range of compute options for AI inference, fine-tuning, and other AI-driven or data-analytics workloads, available through a serverless platform.
Run your AI workloads on high-performance serverless GPUs. Enjoy native autoscaling and Scale-to-Zero.
Launch your workloads today
Over the past few months, we’ve continued to expand what teams can build on Koyeb.
Two months ago, we slashed GPU prices by up to 24% to make high-performance AI infrastructure more accessible. A week ago, we refilled our RTX A6000 stock to support growing demand. Today, we’re extending our offer again with RTX Pro 6000, H200, and B200, bringing the latest GPUs to Koyeb’s serverless platform.
The new GPU instances are available today. Deploy in minutes, scale automatically, and build AI without infrastructure friction.
- Sign up and get started today
- Explore our documentation and tutorials
- Interested in building the future of AI and cloud? We’re hiring!