llama.cpp hardware performance
🔗Comparison Motivation
For some background, I recently acquired dual P100 16GiB GPUs for local LLM inferencing. These GPUs are good - and the combined VRAM definitely exceeds most model requirements, making it a good fit for my use-case.
However, I always wondered how model inferencing would scale across some alternative hardware options. By using vast.ai's low-cost GPU hosts, I was able to test a couple generations of hardware head-to-head on models that are more relevant for local development.
Note that a more comprehensive analysis has been performed on this llama.cpp issue. But for my purposes, the scores were not really on models that I was interested in.
I wanted performance numbers for models I would actually care to use, and reflecting the most recent commits on llama.cpp, given the constant progress being made on performance enhancements.
🔗Results
For all results:
- 2xP100 was run locally on bare-metal OS
- All other GPU options were run through Vast's container runtime
I don't expect any real performance overhead, but thought it worth mentioning.
I also didn't bother measuring prompt processing speed, since that's a little less relevant for my use-case.
🔗mradermacher/Qwen3.5-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled-i1-GGUF:Q4_K_M
A decent fine-tune of Qwen3.5-35B-A3B's fantastic model.
- 2xP100 (Cuda 12.9)
- TG: 40.07 t/s
- 2xV100 (Cuda 12.9)
- TG: 95.73 t/s
- 1xA100 (Cuda 13.1)
- TG: 125.51 t/s
- 2xRTX 5070 Ti (Cuda 12.8)
- TG: 152.60 t/s
🔗unsloth/Nemotron-3-Nano-30B-A3B-GGUF:UD-Q4_K_XL
Nvidia's MoE model trained to be pretty fast at inferencing.
- 2xP100 (Cuda 12.9)
- TG: 66.54 t/s
- 2xV100 (Cuda 12.9)
- TG: 128.07 t/s
- 1xA100 (Cuda 13.1)
- TG: 177.19 t/s
- 2xRTX 5070 Ti (Cuda 12.8)
- TG: 194.78 t/s
🔗unsloth/gpt-oss-20b-GGUF:UD-Q4_K_XL
- 2xP100 (Cuda 12.9)
- TG: 65.47 t/s
- 2xV100 (Cuda 12.9)
- TG: 145.22 t/s
- 1xA100 (Cuda 13.1)
- TG: 183.54 t/s
- 2xRTX 5070 Ti (Cuda 12.8)
- TG: 219.96 t/s
🔗Conclusion
For the £160 that I paid for them, the dual P100s do punch above their weight. If I did need faster inferencing, the V100s at their current pricing look like a reasonable upgrade.
Meanwhile the A100 option in particular doesn't look even close to price-competitive.