Apr 29, 2025
4 min read

Achieve 5x Faster Inference Speeds on Serverless GPUs with Pruna AI and Koyeb

Today, we are excited to announce our partnership with Pruna AI. Pruna AI is the optimization engine built to simplify and accelerate scalable inference. Koyeb offers a serverless cloud platform for teams to deploy ML and AI models on high-performance GPUs, CPUs, and accelerators - globally.

By combining Pruna with Koyeb, you can speed up your model optimizations, achieve 5x faster inference speeds, and run them on scalable, high-performance serverless infrastructure. Our collaboration optimizes AI workloads to make them faster, more efficient, and easier to scale.

5x faster inference speeds

Complex models slow down inference, increase costs, and require more resources to run. Pruna specializes in making large models more efficient and faster without compromising performance.

Their compression techniques reduce model sizes and dramatically cut inference time and resource usage:

  • Pruning: Remove unnecessary parts without affecting output quality
  • Quantization: Reduce memory requirements and speeds up inference
  • Compilation: Compilation ensures that your models run as efficiently as possible, maximizing both speed and resource use
  • Batching: Handle more tasks in parallel, especially in inference-heavy environments

With these optimizations, large models run seamlessly on smaller, more cost-effective Koyeb GPU instances. You get faster inference, reduced infrastructure costs, and top-tier performance at the model and infrastructure level.

Optimize Whisper, Stable Diffusion, Video Generation, LLMs, and more

From large language models to multimodal models, Pruna’s optimization engine delivers best performance for all types of models: LLMs, image generation, image and video generation, audio, computer vision, and more.

Optimize and deploy models like Whisper, Stable Diffusion, Flux, and more with Pruna on Koyeb.

Fast Deployments on Serverless GPUs

You can run any model optimized by Pruna on best-in-class infrastructure without managing any of the complexity. With built-in autoscaling and scale-to-zero, Koyeb maximizes infrastructure efficiency and enables you to scale from zero to hundreds of instances in seconds.

Run optimized models on Koyeb’s serverless GPUs, including:

  • NVIDIA H100
  • NVIDIA A100
  • NVIDIA L40S
  • NVIDIA L4
  • NVIDIA RTX A6000
  • NVIDIA RTX 4000

2x, 4x, and 8x H100 and 2x, 4x, and 8x A100 GPUs Instances are available as well.

Get started today: Deploy Pruna AI Flux.1 [dev] Juiced on Koyeb in One-Click

The Pruna AI Flux.1 [dev] Juiced model is a highly optimized version of Black Forest Labs' FLUX.1 [dev], enhanced by Pruna AI to achieve 5x to 9x faster inference speeds without sacrificing quality.

Pruna AI's advanced optimization techniques, rooted in cutting-edge AI Efficiency research, ensure that this model maintains high fidelity while significantly boosting performance.

The default configuration (”juiced”) provides a safe balance between speed and quality. However, additional settings are available, allowing users to either prioritize consistent output quality (”lightly juiced”) or push for even faster inference times (”extra juiced”).

5x faster Flux.1 [dev] inference speed

Enjoy a dedicated API endpoint on Koyeb GPU for high-performance, low-latency, and efficient inference for the Pruna optimized version of the Flux.1 [dev] model.

Deploy Now

Faster, scalable, and efficient AI models

We’re thrilled to team up with Pruna to bring you lightning-fast, scalable inference — without the complexity.

Koyeb’s platform is built to accelerate, scale, and optimize AI deployments. Pruna optimization engine optimizes models. Combining the two lets your run the most efficient, fastest, and performant models in production.

Sign up for Koyeb and start deploying Pruna-optimized models in one click.

We can’t wait to see what you build and scale with Pruna and Koyeb!

Resources

Celebrate with us & AMA!

Join us for a live webinar with Pruna, where we'll show how to combine their model compression techniques with Koyeb’s infrastructure to deploy faster, leaner AI — without compromising performance.

We’ll dive into real-world examples, walk through the setup, and answer your questions on optimizing both models and infrastructure.

RSVP and save your spot!

Optimizing Models and Deployments with Pruna and Koyeb!

Join our webinar next week to learn how to use Pruna to optimize your models and to deploy them on high-performance serverless GPUs in seconds with Koyeb.

RSVP

Deploy AI apps to production in minutes

Get started
Koyeb is a developer-friendly serverless platform to deploy apps globally. No-ops, servers, or infrastructure management.
Service is degraded
© Koyeb