GPU LLM Benchmarks

Koyeb provides these benchmarks for large language models running on Koyeb across multiple GPU types (NVIDIA A100, H100, and L40S) to showcase how performance scales with different workloads. The benchmarks capture latency, throughput, and token usage under controlled input/output sizes, so you can have clear comparisons between models and hardware.

Benchmarking purpose and methodology

By systematically measuring model behavior across a range of input and output sizes, we can identify scaling characteristics, throughput limits, and potential performance bottlenecks to help you make informed decisions about which models and GPU types are best suited for specific applications.

What we measure

Batch size - the number of concurrent requests sent to the server. Batch size 1 measures latency, i.e. how fast a single request runs end-to-end. Larger batch sizes test throughput, showing how well the endpoint scales when handling multiple requests at once. If throughput grows but average latency also grows significantly, that can signal saturation of GPU or system resources.

Wall-clock time (s) – the real elapsed time of the benchmark run, useful for spotting scaling issues.

Token counts (Input tokens/Output tokens) – the number of input tokens, output tokens, and total tokens actually processed by the model.

Throughput (t/s) – the number of tokens processed per second, often broken down into input tokens, output tokens, and combined totals.

Avg Latency (s) – the average time required to complete a single request from submission to completion.

How we measure

We define token shapes (e.g. 1024x1024) to represent target input and output lengths. These shapes specify the intended workload.
The benchmarking harness generates synthetic prompts of the requested shape and submits them to the deployed model endpoints.
Each request is wrapped in a standardized system prompt and instruction template. This means the actual input length (ie Input Tokens) will typically exceed the requested token shape due to added benchmarking overhead.
For every request, the benchmark records token counts, throughput, and latency.

Why this approach

By controlling the workload and measuring real responses, you can:

Compare performance across different models and GPUs on equal terms.
Observe how latency and throughput scale with increasing input and output sizes.
Detect mismatches between expected and actual token usage.
Provide reproducible data for performance tuning and capacity planning.

Hardware tested

The benchmarks compare the performance of the following processors:

LLM inference benchmarks

The following charts outline performance of the designated LLM running on specific hardware:

Llama 3.1 8B Instruct

The following chart compares Llama 3.1 8B Instruct performance on different GPUs. Deploy Llama 3.1 8B instruct on Koyeb (opens in a new tab)

GPU	Token Shape	Batch Size	Wall-clock Time s	Input Tokens	Output Tokens	Throughput t/s	Avg Latancy s
H100	512x512	1	5.52	1487	512	92.72	5.52
A100SXM			6.50	1484	512	78.77	6.50
A100			6.37	1501	512	80.31	6.38
L40S			11.69	1490	512	43.79	11.69
H100	512x512	8	5.95	11888	4096	688.26	0.74
A100SXM			6.79	11896	4096	603.29	0.85
A100			6.90	11816	4096	593.25	0.86
L40S			12.59	12008	4096	325.14	1.57
H100	512x512	32	6.66	47264	16384	2461.67	0.21
A100SXM			8.48	48416	16384	1932.73	0.26
A100			8.71	47232	16384	1878.95	0.27
L40S			14.57	47264	16384	1124.41	0.46
H100	1024x1024	1	10.98	2926	1024	93.29	10.98
A100SXM			5.84	2948	467	79.99	5.84
A100			5.93	2945	467	78.65	5.94
L40S			10.75	2929	466	43.33	10.75
H100	1024x1024	8	6.01	23448	3815	634.55	0.75
A100SXM			8.53	23504	3934	461.46	1.07
A100			14.34	23416	8192	571.40	1.79
L40S			11.80	23584	3736	316.36	1.48
H100	1024x1024	32	6.80	92608	15332	2254.03	0.21
A100SXM			17.22	94176	23798	1382.08	0.54
A100			9.26	93760	15602	1685.24	0.29
L40S			27.16	93440	16662	613.30	0.85
H100	4096x1024	1	0.42	11575	20	47.49	0.42
A100SXM			5.97	11525	449	75.26	5.97
A100			13.51	11590	1024	75.78	13.51
L40S			16.90	11615	681	40.31	16.90
H100	4096x1024	8	8.76	92464	3150	359.64	1.09
A100SXM			13.67	92608	5923	433.16	1.71
A100			15.70	92832	5879	374.55	1.96
L40S			17.45	92760	3385	193.95	2.18
H100	4096x1024	32	7.37	370944	14648	1987.95	0.23
A100SXM			1.34	370592	640	477.82	0.04
A100			25.06	369184	21075	840.88	0.78
L40S			31.26	371904	15748	503.76	0.98

Deepseek R1 Distill Llama 8b

The following chart compares Deepseek-R1 Distill Llama 8b performance on different GPUs. Deploy Deepseek-R1 Llama 8b on Koyeb (opens in a new tab)

GPU	Token Shape	Batch Size	Wall-clock Time s	Input Tokens	Output Tokens	Throughput t/s	Avg Latency s
H100	512x512	1	5.48	1455	512	93.50	5.48
A100SXM			6.33	1448	512	80.91	6.33
A100			6.50	1444	512	78.74	6.50
L40S			11.68	1469	512	43.83	11.68
H100	512x512	8	5.99	11568	4096	684.31	0.75
A100SXM			6.88	11656	4096	595.22	0.86
A100			6.86	11608	4096	597.33	0.86
L40S			12.63	11472	4096	324.29	1.58
H100	512x512	32	6.71	46176	16384	2442.23	0.21
A100SXM			8.62	46464	16384	1900.13	0.27
A100			8.76	46368	16384	1870.70	0.27
L40S			14.67	46400	16384	1116.65	0.46
H100	1024x1024	1	10.96	2906	1024	93.42	10.96
A100SXM			12.75	2895	1024	80.30	12.75
A100			12.78	2898	1024	80.11	12.78
L40S			23.53	2891	1024	43.52	23.53
H100	1024x1024	8	12.43	23408	8192	658.91	1.55
A100SXM			14.31	23144	8192	572.29	1.79
A100			14.25	23048	8088	567.76	1.78
L40S			25.77	23176	8070	313.13	3.22
H100	1024x1024	32	13.93	92768	32768	2353.16	0.44
A100SXM			18.39	92544	30788	1673.81	0.57
A100			19.10	94208	32594	1706.64	0.60
L40S			31.29	93408	32768	1047.12	0.98
H100	4096x1024	1	11.65	11567	1024	87.89	11.65
A100SXM			12.75	11555	961	75.38	12.75
A100			13.55	11536	1024	75.57	13.55
L40S			24.06	11554	977	40.60	24.06
H100	4096x1024	8	15.29	91792	8087	528.94	1.91
A100SXM			17.89	92152	8160	456.14	2.24
A100			17.14	92824	7412	432.51	2.14
L40S			29.24	92728	8192	280.16	3.65
H100	4096x1024	32	15.25	368160	32194	2110.54	0.48
A100SXM			26.09	369568	32169	1232.90	0.82
A100			25.73	369056	31516	1225.06	0.80
L40S			34.64	370912	25953	749.24	1.08

Qwen 2.5 7B Instruct

The following chart compares Qwen 2.5 7B Instruct performance on different GPUs. Qwen 2.5 7B Instruct on Koyeb (opens in a new tab)

GPU	Token Shape	Batch Size	Wall-clock Time s	Input Tokens	Output Tokens	Throughput t/s	Avg Latency s
H100	512x512	1	2.10	1465	196	93.44	2.10
A100SXM			2.09	1482	171	81.64	2.09
A100			1.68	1468	134	79.68	1.68
L40S			4.76	1488	218	45.81	4.76
H100	512x512	8	1.98	11832	1400	705.50	0.25
A100SXM			4.03	11848	2397	594.99	0.50
A100			3.64	11744	1366	375.47	0.45
L40S			3.28	11728	1068	325.85	0.41
H100	512x512	32	3.37	47264	6597	1959.53	0.11
A100SXM			2.92	47456	5667	1942.70	0.09
A100			2.71	47296	4542	1675.51	0.08
L40S			4.75	47136	4546	956.48	0.15
H100	1024x1024	1	1.95	2907	171	87.59	1.95
A100SXM			2.90	2928	242	83.34	2.90
A100			5.53	2885	474	85.72	5.53
L40S			22.02	2915	1024	46.50	22.02
H100	1024x1024	8	3.70	23472	2484	671.63	0.46
A100SXM			1.58	23440	882	558.65	0.20
A100			12.83	23320	8192	638.73	1.60
L40S			5.99	23600	1760	293.73	0.75
H100	1024x1024	32	11.59	95104	14417	1243.69	0.36
A100SXM			13.24	94624	12844	969.98	0.41
A100			4.48	93568	6983	1559.93	0.14
L40S			9.26	93312	9726	1050.35	0.29
H100	4096x1024	1	3.88	11680	369	95.02	3.88
A100SXM			4.48	11679	366	81.72	4.48
A100			4.17	11590	338	81.07	4.17
L40S			11.44	11652	506	44.25	11.44
H100	4096x1024	8	5.57	92320	2146	385.43	0.70
A100SXM			3.85	92776	2004	520.74	0.48
A100			4.67	93216	2194	469.73	0.58
L40S			10.18	92576	3021	296.81	1.27
H100	4096x1024	32	8.70	370208	10905	1253.42	0.27
A100SXM			17.25	371904	17094	990.81	0.54
A100			4.29	370912	5680	1323.11	0.13
L40S			14.08	371040	11221	797.01	0.44

GitHub NVIDIA A100