DeepSparse Server

DeepSparse is an inference runtime taking advantage of sparsity with neural networks offering GPU-class performance on CPUs.


DeepSparse is a CPU inference runtime that takes advantage of sparsity to accelerate neural network inference. The server lets you to set up a model-serving endpoint running DeepSparse to send raw data to DeepSparse over HTTP and receive the post-processed predictions.


DeepSparse server supports any task from DeepSparse, such as Pipelines including NLP, image classification, and object detection tasks. An up-to-date list of available tasks can be found in the DeepSparse Pipelines Introduction.

The default configuration of this app initialize DeepSparse server with the zoo:nlp/sentiment_analysis/obert-base/pytorch/huggingface/sst2/pruned90_quant-none BERT model, launched with the following command:

deepsparse.server --task sentiment_analysis --model_path zoo:nlp/sentiment_analysis/obert-base/pytorch/huggingface/sst2/pruned90_quant-none

You can customize the configuration of the DeepSparse server by adjusting the Docker args in the Koyeb Service settings page. For example, to perform object detection using a YOLOv8 model, change the model_path to zoo:cv/detection/yolov8-s/pytorch/ultralytics/coco/pruned50_quant-none and the task to yolov8.

Try it out

Once the DeepSparse server is deployed, you can start sending request to the /v2/models/sentiment_analysis/infer endpoint to get predictions. For example, to receive BERT's inference of the sentiment of a Tweet, you can send the following request:

$ curl https://<YOUR_APP_NAME>-<YOUR_KOYEB_ORG> -X POST \
  -H "Content-Type: application/json" \
  -d '{"sequences": "Just deployed my @neuralmagic DeepSparse Server on @gokoyeb and I must say! Match made in heaven 😍"}'


Related One-Click Apps in this category

  • LangServe

    LangServe makes it easy to deploy LangChain applications as RESTful APIs.

  • LlamaIndex

    LlamaIndex gives you the tools you need to build production-ready LLM applications from your organization's data.

  • Ollama

    Ollama is a self-hosted AI solution to run open-source large language models on your own infrastructure.

The fastest way to deploy applications globally.