Hardware
GPU Benchmarks

GPU LLM Benchmarks

Koyeb provides these benchmarks for large language models running on Koyeb across multiple GPU types (NVIDIA A100, H100, and L40S) to showcase how performance scales with different workloads. The benchmarks capture latency, throughput, and token usage under controlled input/output sizes, so you can have clear comparisons between models and hardware.

Benchmarking purpose and methodology

By systematically measuring model behavior across a range of input and output sizes, we can identify scaling characteristics, throughput limits, and potential performance bottlenecks to help you make informed decisions about which models and GPU types are best suited for specific applications.

What we measure

Wall-clock time (s) – the real elapsed time of the benchmark run, useful for spotting scaling issues.

Token counts (Input tokens/Output tokens) – the number of input tokens, output tokens, and total tokens actually processed by the model.

Throughput (t/s) – the number of tokens processed per second, often broken down into input tokens, output tokens, and combined totals.

Avg Latency (s) – the time required to complete a single request from submission to completion.

How we measure

  1. We define token shapes (e.g. 1024x1024) to represent target input and output lengths. These shapes specify the intended workload.

  2. The benchmarking harness generates synthetic prompts of the requested shape and submits them to the deployed model endpoints.

  3. Each request is wrapped in a standardized system prompt and instruction template. This means the actual input length (ie Input Tokens) will typically exceed the requested token shape due to added benchmarking overhead.

  4. For every request, the benchmark records token counts, throughput, and latency.

Why this approach

By controlling the workload and measuring real responses, you can:

  • Compare performance across different models and GPUs on equal terms.
  • Observe how latency and throughput scale with increasing input and output sizes.
  • Detect mismatches between expected and actual token usage.
  • Provide reproducible data for performance tuning and capacity planning.

LLM inference benchmarks

The following charts outline performance of the designated LLM running on specific hardware:

Llama 3.1 8B Instruct

The following chart compares Llama 3.1 8B Instruct performance on different GPUs. Deploy Llama 3.1 8B instruct on Koyeb (opens in a new tab)

GPUToken ShapeBatch SizeWall-clock Time sInput TokensOutput TokensThroughput t/sAvg Latancy s
H1001024x1024110.982926102493.2910.98
A100SXM5.84294846779.995.84
A1005.93294546778.655.94
L40S10.75292946643.3310.75
H1001024x102486.01234483815634.550.75
A100SXM8.53235043934461.461.07
A10014.34234168192571.401.79
L40S11.80235843736316.361.48
H1001024x1024326.8092608153322254.030.21
A100SXM17.2294176237981382.080.54
A1009.2693760156021685.240.29
L40S27.169344016662613.300.85
H1004096x102410.42115752047.490.42
A100SXM5.971152544975.265.97
A1004.941160136273.324.94
L40S0.77115862025.820.77
H1004096x102488.76924643150359.641.09
A100SXM13.67926085923433.161.71
A10013.02930565340409.881.63
L40S15.94927203061191.981.99
H1004096x1024327.37370944146481987.950.23
A100SXM1.34370592640477.820.04
A1003237129618097869.910.65
L40S3237084826163725.481.13
H100512x51215.52148751292.725.52
A100SXM6.50148451278.776.50
A1006.37150151280.316.38
L40S111.69149051243.7911.69
H100512x51285.95118884096688.260.74
A100SXM86.79118964096603.290.85
A1006.90118164096593.250.86
L40S12.59120084096325.141.57
H100512x512326.6647264163842461.670.21
A100SXM8.4848416163841932.730.26
A1008.7147232163841878.950.27
L40S14.5747264163841124.410.46

Deepseek R1 Distill Llama 8b

The following chart compares Deepseek-R1 Distill Llama 8b performance on different GPUs. Deploy Deepseek-R1 Llama 8b on Koyeb (opens in a new tab)

GPUToken ShapeBatch SizeWall-clock Time sInput TokensOutput TokensThroughput t/sAvg Latency s
H1001024x1024110.962906102493.4210.96
A100SXM12.752895102480.3012.75
A10012.782898102480.1112.78
L40S23.532891102443.5223.53
H1001024x1024812.43234088192658.911.55
A100SXM14.31231448192572.291.79
A10014.25230488088567.761.78
L40S25.77231768070313.133.22
H1001024x10243213.9392768327682353.160.44
A100SXM18.3992544307881673.810.57
A10019.1094208325941706.640.60
L40S31.2993408327681047.120.98
H1004096x1024111.6511567102487.8911.65
A100SXM12.751155596175.3812.75
A10013.5511536102475.5713.55
L40S24.061155497740.6024.06
H1004096x1024815.29917928087528.941.91
A100SXM17.89921528160456.142.24
A10017.14928247412432.512.14
L40S29.24927288192280.163.65
H1004096x10243215.25368160321942110.540.48
A100SXM26.09369568321691232.900.82
A10025.73369056315161225.060.80
L40S34.6437091225953749.241.08
H100512x51215.48145551293.505.48
A100SXM6.33144851280.916.33
A1006.50144451278.746.50
L40S11.68146951243.8311.68
H100512x51285.99115684096684.310.75
A100SXM6.88116564096595.220.86
A1006.86116084096597.330.86
L40S12.63114724096324.291.58
H100512x512326.7146176163842442.230.21
A100SXM8.6246464163841900.130.27
A1008.7646368163841870.700.27
L40S14.6746400163841116.650.46