gpt-oss-20b
Deploy OpenAI gpt-oss-20b with SGLang on Koyeb GPU for high-performance, low-latency, and efficient inference.
Deploy gpt-oss-20b on Koyeb's high-performance cloud infrastructure.
With one click, get a dedicated GPU-powered inference endpoint ready to handle requests with built-in autoscaling and Scale-to-Zero.
Get up to $200 in credit to get started!
Overview of gpt-oss-20b
The gpt-oss-20b model is a state-of-the-art open-weight language model that delivers strong real-world performance at low cost. Available under the flexible Apache 2.0 license, this model outperforms similarly sized open models on reasoning tasks, demonstrate strong tool use capabilities, and are optimized for efficient deployment on consumer hardware. It was trained using a mix of reinforcement learning and techniques informed by OpenAI’s most advanced internal models, including o3 and other frontier systems.
The gpt-oss-20B model is served with the SGLang inference engine, optimized for high-throughput and low-latency model serving.
The default GPU for running this model is the NVIDIA H100 Instance type. You are free to adjust the GPU Instance type to fit your workload requirements.
Quickstart
The gpt-oss-20b one-click model is served using the SGLang engine. SGLang is a fast serving framework for large language models and vision language models. It makes your interaction with models faster and more controllable by co-designing the backend runtime and frontend language.
After you deploy gpt-oss-20b, copy the Koyeb App public URL similar to https://<YOUR_DOMAIN_PREFIX>.koyeb.app
and create a simple Python file with the following content to start interacting with the model.
import os
from openai import OpenAI
from transformers import AutoTokenizer
client = OpenAI(
api_key = os.environ.get("OPENAI_API_KEY", "fake"),
base_url="https://<YOUR_DOMAIN_PREFIX>.koyeb.app/v1",
)
tokenizer = AutoTokenizer.from_pretrained("openai/gpt-oss-20b", trust_remote_code=True)
messages=[{"role": "system", "content": "You are ChatGPT."},{"role": "user", "content": "Tell me a joke."}]
prompt = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
response = self.client.completions.create(
prompt=prompt,
model="/models/openai/gpt-oss-20b",
max_tokens=30,
)
print(response.choices[0].text)
The snippet above is using the OpenAI SDK to interact with the gpt-oss-20b model thanks to vLLM OpenAI compatibility.
Take care to replace the base_url
value in the snippet with your Koyeb App public URL.
Executing the Python script will return the model's response to the input message.
python main.py
"analysisUser wants a joke. Simple.assistantfinalWhy don’t skeletons ever go out on Halloween?
Because they’re afraid of “spook‑tacular” surprises — they don’t have the guts for it! 😄
<re.Match object; span=(43, 95), match='finalWhy don’t skeletons ever go out on Halloween>"
Securing the Inference Endpoint
To ensure that only authenticated requests are processed, we recommend setting up an API key to secure your inference endpoint. Follow these steps to configure the API key:
- Generate a strong, unique API key to use for authentication
- Navigate to your Koyeb Service settings
- Add a new environment variable named
VLLM_API_KEY
and set its value to your secret API key - Save the changes and redeploy to update the service
Once the service is updated, all requests to the inference endpoint will require the API key.
When making requests, ensure the API key is included in the headers. If you are using the OpenAI SDK, you can provide the API key through the api_key
parameter when instantiating the OpenAI client. Alternatively, you can set the API key using the OPENAI_API_KEY
environment variable. For example:
OPENAI_API_KEY=<YOUR_API_KEY> python main.py