Deploy PaddlePaddle FastDeploy One-Click App

Overview

Deploy FastDeploy for free

Get up to $200 in credit to get started!

FastDeploy is an inference and deployment toolkit for large language models and visual language models based on PaddlePaddle. It delivers production-ready, out-of-the-box deployment solutions with core acceleration technologies:

🚀 Load-Balanced PD Disaggregation: Industrial-grade solution featuring context caching and dynamic instance role switching. Optimizes resource utilization while balancing SLO compliance and throughput.

🔄 Unified KV Cache Transmission: Lightweight high-performance transport library with intelligent NVLink and RDMA selection.

🤝 OpenAI API Server and vLLM Compatible: One-command deployment with vLLM interface compatibility.

🧮 Comprehensive Quantization Format Support: W8A16, W8A8, W4A16, W4A8, W2A16, FP8, and more.

⏩ Advanced Acceleration Techniques: Speculative decoding, Multi-Token Prediction (MTP) and Chunked Prefill.

🖥️ Multi-Hardware Support: NVIDIA GPU, Kunlunxin XPU, Hygon DCU, Iluvatar GPU, Enflame GCU, MetaX GPU, Intel Gaudi etc.

For the complete project and full documentation, please refer to the FastDeploy GitHub Repo

PaddlePaddle FastDeploy

Overview

Deploy AI apps to production in minutes