Deploy Llama 3.1 8B Instruct One-Click App

Deploy Llama 3.1 8B Instruct on Koyeb high-performance infrastructure. Get a dedicated endpoint running on GPU in seconds to handle inference requests with zero-configuration.

Scale to millions of requests with built-in autoscaling. Scale up with demand and scale down to zero during idle periods.

Deploy Llama 3.1 8B Instruct for free

Get up to $200 in credit to get started!

Claim credit

Overview of Llama 3.1 8B Instruct

Meta’s Llama 3.1 8B model is a state-of-the-art, open-source large language model built for high-quality text generation. With 8 billion parameters, it is ideal for content generation, conversational AI, and text summarization.

Llama 3.1 8B Instruct will be served using vLLM inference engine designed for high-throughput and low-latency model serving.

The default GPU for running this model is the Nvidia A100 instance type. You are free to adjust the GPU instance type to fit your workload requirements.

Quickstart

The Llama 3.1 8B Instruct one-click model deployment is served using vLLM. vLLM is an advanced inference engine designed for high-throughput and low-latency model serving. Optimized for large language models, it provides efficient performance and compatibility with the OpenAI API.

After you deploy the model, copy the Koyeb App public URL similar to https://<YOUR_DOMAIN_PREFIX>.koyeb.app and create a simple Python file with the following content to start interacting with the model.

import os

from openai import OpenAI

client = OpenAI(
  api_key = os.environ.get("OPENAI_API_KEY", "fake"),
  base_url="https://<YOUR_DOMAIN_PREFIX>.koyeb.app/v1",
)

chat_completion = client.chat.completions.create(
    messages=[
        {
            "role": "user",
            "content": "Tell me a joke.",
        }
    ],
    model="meta-llama/Llama-3.1-8B-Instruct",
    max_tokens=30,
)

print(chat_completion.to_json(indent=4))

The snippet above is using the OpenAI SDK to interact with the Llama 3.1 8B Instruct model thanks to vLLM OpenAI compatibility.

Take care to replace the base_url value in the snippet with your Koyeb App public URL.

Executing the Python script will return the model's response to the input message.

$ python main.py

{
    "id": "chatcmpl-a94edf120cb74cc995d93ec82afc4b53",
    "choices": [
        {
            "finish_reason": "length",
            "index": 0,
            "logprobs": null,
            "message": {
                "content": "A man walks into a library and asks the librarian, \"Do you have any books on Pavlov's dogs and Schrödinger's cat",
                "role": "assistant",
                "tool_calls": []
            },
            "stop_reason": null
        }
    ],
    "created": 1732135919,
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "object": "chat.completion",
    "usage": {
        "completion_tokens": 30,
        "prompt_tokens": 40,
        "total_tokens": 70,
        "prompt_tokens_details": null
    },
    "prompt_logprobs": null
}

Securing the Inference Endpoint

To ensure that only authenticated requests are processed, we recommend setting up an API key to secure your inference endpoint. Follow these steps to configure the API key:

Generate a strong, unique API key to use for authentication
Navigate to your Koyeb Service settings
Add a new environment variable named VLLM_API_KEY and set its value to your secret API key
Save the changes and redeploy to update the service

Once the service is updated, all requests to the inference endpoint will require the API key.

When making requests, ensure the API key is included in the headers. If you are using the OpenAI SDK, you can provide the API key through the api_key parameter when instantiating the OpenAI client. Alternatively, you can set the API key using the OPENAI_API_KEY environment variable. For example:

$ OPENAI_API_KEY=<YOUR_API_KEY> python main.py

Benchmarks

The following chart compares Llama 3.1 8B Instruct performance on available GPUs:

GPU	Token Shape	Batch Size	Wall-clock Time s	Input Tokens	Output Tokens	Throughput t/s	Avg Latancy s
H200	512x512	1	3.04	511	512	168.65	3.04
H100			5.52	1487	512	92.72	5.52
A100SXM			6.50	1484	512	78.77	6.50
A100			6.37	1501	512	80.31	6.38
RTX PRO 6000			6.07	511	511	84.32	6.07
L40S			11.69	1490	512	43.79	11.69
H200	512x512	8	3.15	4088	3814	1210.37	0.39
H100			5.95	11888	4096	688.26	0.74
A100SXM			6.79	11896	4096	603.29	0.85
A100			6.90	11816	4096	593.25	0.86
RTX PRO 6000			6.93	4088	3495	504.47	0.87
L40S			12.59	12008	4096	325.14	1.57
H200	512x512	32	3.61	16349	14277	3952.77	0.11
H100			6.66	47264	16384	2461.67	0.21
A100SXM			8.48	48416	16384	1932.73	0.26
A100			8.71	47232	16384	1878.95	0.27
RTX PRO 6000			8.64	16349	14402	1666.80	0.27
L40S			14.57	47264	16384	1124.41	0.46
H200	1024x1024	1	6.14	1023	1024	166.79	6.14
H100			10.98	2926	1024	93.29	10.98
A100SXM			5.84	2948	467	79.99	5.84
A100			5.93	2945	467	78.65	5.94
RTX PRO 6000			12.24	1023	1024	83.67	12.24
L40S			10.75	2929	466	43.33	10.75
H200	1024x1024	8	6.48	8184	7405	1142.07	0.81
H100			6.01	23448	3815	634.55	0.75
A100SXM			8.53	23504	3934	461.46	1.07
A100			14.34	23416	8192	571.40	1.79
RTX PRO 6000			14.20	8184	7409	521.64	1.78
L40S			11.80	23584	3736	316.36	1.48
H200	1024x1024	32	8.52	32736	26016	3053.78	0.26
H100			6.80	92608	15332	2254.03	0.21
A100SXM			17.22	94176	23798	1382.08	0.54
A100			9.26	93760	15602	1685.24	0.29
RTX PRO 6000			18.48	32736	26715	1445.81	0.58
L40S			27.16	93440	16662	613.30	0.85
H200	4096x1024	1	6.23	4095	1024	164.32	6.23
H100			0.42	11575	20	47.49	0.42
A100SXM			5.97	11525	449	75.26	5.97
A100			13.51	11590	1024	75.78	13.51
RTX PRO 6000			12.65	4095	1024	80.98	12.65
L40S			16.90	11615	681	40.31	16.90
H200	4096x1024	8	7.85	32760	8192	1043.82	0.98
H100			8.76	92464	3150	359.64	1.09
A100SXM			13.67	92608	5923	433.16	1.71
A100			15.70	92832	5879	374.55	1.96
RTX PRO 6000			17.28	32760	8192	474.11	2.16
L40S			17.45	92760	3385	193.95	2.18
H200	4096x1024	32	12.61	131040	29071	2305.10	0.39
H100			7.37	370944	14648	1987.95	0.23
A100SXM			1.34	370592	640	477.82	0.04
A100			25.06	369184	21075	840.88	0.78
RTX PRO 6000			29.30	131040	29001	989.85	0.92
L40S			31.26	371904	15748	503.76	0.98

For more information on available GPUs, view the documentation for NVIDIA A100, H100, and L40S.

Llama 3.1 8B Instruct

Overview of Llama 3.1 8B Instruct

Quickstart

Securing the Inference Endpoint

Benchmarks

Deploy AI apps to production in minutes