Michael Daniel Kuc

Software Engineer

llama.cpp hardware performance

🔗Comparison Motivation

For some background, I recently acquired dual P100 16GiB GPUs for local LLM inferencing. These GPUs are good - and the combined VRAM definitely exceeds most model requirements, making it a good fit for my use-case.

However, I always wondered how model inferencing would scale across some alternative hardware options. By using vast.ai's low-cost GPU hosts, I was able to test a couple generations of hardware head-to-head on models that are more relevant for local development.

Note that a more comprehensive analysis has been performed on this llama.cpp issue. But for my purposes, the scores were not really on models that I was interested in.

I wanted performance numbers for models I would actually care to use, and reflecting the most recent commits on llama.cpp, given the constant progress being made on performance enhancements.

🔗Results

For all results:

  • 2xP100 was run locally on bare-metal OS
  • All other GPU options were run through Vast's container runtime

I don't expect any real performance overhead, but thought it worth mentioning.

I also didn't bother measuring prompt processing speed, since that's a little less relevant for my use-case.

🔗mradermacher/Qwen3.5-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled-i1-GGUF:Q4_K_M

A decent fine-tune of Qwen3.5-35B-A3B's fantastic model.

  • 2xP100 (Cuda 12.9)
    • TG: 40.07 t/s
  • 2xV100 (Cuda 12.9)
    • TG: 95.73 t/s
  • 1xA100 (Cuda 13.1)
    • TG: 125.51 t/s
  • 2xRTX 5070 Ti (Cuda 12.8)
    • TG: 152.60 t/s

🔗unsloth/Nemotron-3-Nano-30B-A3B-GGUF:UD-Q4_K_XL

Nvidia's MoE model trained to be pretty fast at inferencing.

  • 2xP100 (Cuda 12.9)
    • TG: 66.54 t/s
  • 2xV100 (Cuda 12.9)
    • TG: 128.07 t/s
  • 1xA100 (Cuda 13.1)
    • TG: 177.19 t/s
  • 2xRTX 5070 Ti (Cuda 12.8)
    • TG: 194.78 t/s

🔗unsloth/gpt-oss-20b-GGUF:UD-Q4_K_XL

  • 2xP100 (Cuda 12.9)
    • TG: 65.47 t/s
  • 2xV100 (Cuda 12.9)
    • TG: 145.22 t/s
  • 1xA100 (Cuda 13.1)
    • TG: 183.54 t/s
  • 2xRTX 5070 Ti (Cuda 12.8)
    • TG: 219.96 t/s

🔗Conclusion

For the £160 that I paid for them, the dual P100s do punch above their weight. If I did need faster inferencing, the V100s at their current pricing look like a reasonable upgrade.

Meanwhile the A100 option in particular doesn't look even close to price-competitive.