Gemma 2 9b
Deploy Gemma 2 9BB on Koyeb high-performance GPU.
Deploy Gemma 2 9B large language model on high-performance infrastructure. With one-click, get a dedicated endpoint on GPU that’s ready to handle inference requests instantly, with zero configuration.
Automatically scale your infrastucture based on real-time traffic: scale up during peaks and scale down during idle periods.
Overview of Gemma 2 9B
Part of the Gemma family, Gemma 2 9B is an advanced open-source language model, purpose-built by Google for natural language processing tasks, like content creation and conversational interfaces and summarization. With 9 billions of parameters, it provides high-quality, contextual text generation for diverse applications.
On Koyeb, Gemma 2 9B is served with the vLLM inference engine, which ensures high-throughput, low-latency model serving. This deployment lets users accelerate their AI workflows on high-performance infrastructure with zero configuration.
The default GPU for running this model is the Nvidia A100 instance type. You are free to adjust the GPU instance type to fit your workload requirements.
Quickstart
The Gemma 2 9B one-click model uses the vLLM inference engine, optimized to handle large language models. With seamless integration and OpenAI API compatibility, vLLM offers efficient and powerful model serving.
After you deploy the model, copy the Koyeb App public URL similar to https://<YOUR_DOMAIN_PREFIX>.koyeb.app
and create a simple Python file with the following content to start interacting with the model.
import os
from openai import OpenAI
client = OpenAI(
api_key = os.environ.get("OPENAI_API_KEY", "fake"),
base_url="https://<YOUR_DOMAIN_PREFIX>.koyeb.app/v1",
)
chat_completion = client.chat.completions.create(
messages=[
{
"role": "user",
"content": "Tell me a joke.",
}
],
model="google/gemma-2-9b-it",
max_tokens=30,
)
print(chat_completion.to_json(indent=4))
The snippet above is using the OpenAI SDK to interact with the Gemma 2 9B model thanks to vLLM OpenAI compatibility.
Take care to replace the base_url
value in the snippet with your Koyeb App public URL.
Executing the Python script will return the model's response to the input message.
python main.py
{
"id": "chatcmpl-a94edf120cb74cc995d93ec82afc4b53",
"choices": [
{
"finish_reason": "length",
"index": 0,
"logprobs": null,
"message": {
"content": "A man walks into a library and asks the librarian, \"Do you have any books on Pavlov's dogs and Schrödinger's cat",
"role": "assistant",
"tool_calls": []
},
"stop_reason": null
}
],
"created": 1732135919,
"model": "google/gemma-2-9b-it",
"object": "chat.completion",
"usage": {
"completion_tokens": 30,
"prompt_tokens": 40,
"total_tokens": 70,
"prompt_tokens_details": null
},
"prompt_logprobs": null
}
Securing the Inference Endpoint
To ensure that only authenticated requests are processed, we recommend setting up an API key to secure your inference endpoint. Follow these steps to configure the API key:
- Generate a strong, unique API key to use for authentication
- Navigate to your Koyeb Service settings
- Add a new environment variable named
VLLM_API_KEY
and set its value to your secret API key - Save the changes and redeploy to update the service
Once the service is updated, all requests to the inference endpoint will require the API key.
When making requests, ensure the API key is included in the headers. If you are using the OpenAI SDK, you can provide the API key through the api_key
parameter when instantiating the OpenAI client. Alternatively, you can set the API key using the OPENAI_API_KEY
environment variable. For example:
OPENAI_API_KEY=<YOUR_API_KEY> python main.py