r/LocalLLaMA • u/randomfoo2 • Jan 08 '24

AMD Radeon 7900 XT/XTX Inference Performance Comparisons Resources

I recently picked up a 7900 XTX card and was updating my AMD GPU guide (now w/ ROCm info). I also ran some benchmarks, and considering how Instinct cards aren't generally available, I figured that having Radeon 7900 numbers might be of interest for people. I compared the 7900 XT and 7900 XTX inferencing performance vs my RTX 3090 and RTX 4090.

I used TheBloke's LLama2-7B quants for benchmarking (Q4_0 GGUF, GS128 No Act Order GPTQ with both llama.cpp and ExLlamaV2:

llama.cpp

	7900 XT	7900 XTX	RTX 3090	RTX 4090
Memory GB	20	24	24	24
Memory BW GB/s	800	960	936.2	1008
FP32 TFLOPS	51.48	61.42	35.58	82.58
FP16 TFLOPS	103.0	122.8	71/142*	165.2/330.3*
Prompt tok/s	2065	2424	2764	4650
Prompt %	-14.8%	0%	+14.0%	+91.8%
Inference tok/s	96.6	118.9	136.1	162.1
Inference %	-18.8%	0%	+14.5%	+36.3%

Tested 2024-01-08 with llama.cpp b737982 (1787) and latest ROCm (dkms amdgpu/6.3.6-1697589.22.04, rocm 6.0.0.60000-91~22.04 ) and CUDA (dkms nvidia/545.29.06, 6.6.7-arch1-1, nvcc cuda_12.3.r12.3/compiler.33492891_0 ) on similar platforms (5800X3D for Radeons, 5950X for RTXs)

ExLLamaV2

	7900 XT	7900 XTX	RTX 3090	RTX 4090
Memory GB	20	24	24	24
Memory BW GB/s	800	960	936.2	1008
FP32 TFLOPS	51.48	61.42	35.58	82.58
FP16 TFLOPS	103.0	122.8	71/142*	165.2/330.3*
Prompt tok/s	3457	3928	5863	13955
Prompt %	-12.0%	0%	+49.3%	+255.3%
Inference tok/s	57.9	61.2	116.5	137.6
Inference %	-5.4%	0%	+90.4%	+124.8%

Tested 2024-01-08 with ExLlamaV2 3b0f523 and latest ROCm (dkms amdgpu/6.3.6-1697589.22.04, rocm 6.0.0.60000-91~22.04 ) and CUDA (dkms nvidia/545.29.06, 6.6.7-arch1-1, nvcc cuda_12.3.r12.3/compiler.33492891_0 ) on similar platforms (5800X3D for Radeons, 5950X for RTXs)

I gave vLLM a try and failed.

One other note is that llama.cpp segfaults if you try to run the 7900XT + 7900XTX together, but ExLlamaV2 seems to run multi-GPU fine (on Ubuntu 22.04.03 HWE + ROCm 6.0).

For inferencing (and likely fine-tuning, which I'll test next), your best bang/buck would likely still be 2 x used 3090's.

Note, on Linux, the default Power Limit on the 7900 XT and 7900 XTX is 250W and 300W respectively. Those might be able to be changed via rocm-smi but I haven't poked around. If anyone has, feel free to post your experience in the comments.

\ EDIT: As pointed out by FireSilicon in the comments, the RTX cards have much better FP16/BF16 Tensor FLOPS performance that the inferencing engines are taking advantage of. Updated FP16 FLOPS (32-bit/16-bit accumulation numbers) sourced from Nvidia docs ([3090](https://images.nvidia.com/aem-dam/en-zz/Solutions/geforce/ampere/pdf/NVIDIA-ampere-GA102-GPU-Architecture-Whitepaper-V1.pdf),* 4090).

96 Upvotes

permalink
link
reddit

You are about to leave Redlib

Do you want to continue?

https://www.reddit.com/r/LocalLLaMA/comments/191srof/amd_radeon_7900_xtxtx_inference_performance/
No, go back! Yes, take me to Reddit
reddit

You are about to leave Redlib

Do you want to continue?

https://www.reddit.com/r/LocalLLaMA/comments/191srof/amd_radeon_7900_xtxtx_inference_performance/
No, go back! Yes, take me to Reddit

98% Upvoted

u/noiserr Jan 08 '24

7900 XTX is 250W and 300W respectively. Those might be able to be changed via rocm-smi but I haven't poked around. If anyone has, feel free to post your experience in the comments.

The upcoming kernel 6.7 should have some additional power controls for RDNA3 GPUs. So we should be able to undervolt once that's out.

I had opted for a AsRock Taichi 7900xtx which has a dual vBIOS switch on the card. One BIOS is factory overclock, and the other is a cool and quiet vBIOS with relaxed voltages and clocks. This is the one I'm using since I like to be closer to that efficiency bell curve.

Also there is a bug that I experienced where 7900xtx had high idle power consumption with the model loaded. The workaround is to provide an environment variable:

# RDNA3
export HSA_OVERRIDE_GFX_VERSION=11.0.0
# workaround for high idle power
export GPU_MAX_HW_QUEUES=1

This fixed the issue for me, with no performance impact. AMD is aware of it, and they are working on it, you can follow the issue here: https://github.com/ROCm/ROCK-Kernel-Driver/issues/153

2

u/AgeOfAlgorithms Jan 08 '24

This is great to know. Thanks for sharing!
1
u/Combinatorilliance Jan 08 '24

Have you experienced an issue with severe screen flickering? I started having the issue after updating to ROCm 6.0. Nothing particularly interesting appears in my logs, but it does seem related to running out of VRAM.

Others have noticed the issue recently too. I haven't seen anything on AMD's ROCm GitHub repos, although I'm not sure I've been looking in the right repository since the one you posted is different from the one I was primarily looking at.

https://www.reddit.com/r/LocalLLaMA/comments/18nfwy5/screen_flickering_in_linux_when_offloading_layers/

I'll for sure try the GPU_MAX_HW_QUEUES env var and see if it makes the difference, the GitHub post linked does seem related to what I'm experiencing. 100% power draw might be the cause of the instability.
2
u/noiserr Jan 08 '24 edited Jan 08 '24
I have not experienced issues with screen blanking. When I run out of VRAM my koboldCPP (ROCm fork) just seg faults.

Maybe it's related to the version of kernel you have and ROCm 6.

I'm on 6.6.6-76060606-generic (latest with Pop!_OS) so many 6s spooky :)

This is the version of ROCm I'm running:
$ apt list | grep rocm6

rocm6.0.0/jammy 6.0.0.60000-91~22.04 amd64
Or like you said it could be related to power.
2

u/Combinatorilliance Jan 08 '24

Thanks for the kernel info, I'm on a little bit older kernel since I'm using vanilla Ubuntu. I'll try updating the kernel or try a dual-boot.

I'm genuinely considering switching to nix for reproducible builds for the issues I'm having with ROCm alone 😅

1

u/noiserr Jan 08 '24

I'm loving Pop!_OS personally, and it's debian/ubuntu based so if you're familiar with ubuntu you'll feel right at home. I like it because updates are a bit more forthcoming, particularly the Kernel updates. And the default UI / desktop setup is more appealing to me personally.

I wrote a guide on how to get ROCm 6 installed on Pop!_OS. For RDNA2, but the same works for RDNA3 just use the env variables I provided in the top post of this thread: https://www.reddit.com/r/ROCm/comments/18z29l6/rx_6650_xt_running_pytoch_on_arch_linux_possible/kghsexq/

If you decide to give it a try.

2

u/Combinatorilliance Feb 18 '24

BTW, my issue was fixed after upgrading to an officially supported kernel version, everything is running smoothly again

2

u/noiserr Feb 18 '24

Nice! Thanks for letting me know.

u/fallingdowndizzyvr Jan 08 '24

The 7900xtx has the advantage in memory bandwidth and TFLOPS over the 3090. Yet it's slower. There's a lot of optimization that needs to be done in Team Red.

11

u/noiserr Jan 08 '24 edited Jan 08 '24

A lot of work is being done here, based on the Github activity. Also the datacenter GPUs have precedence with their CDNA architecture, but things are slowly trickling down to RDNA.

vLLM should work and really it should be the best performing back end either way.

6

u/randomfoo2 Jan 08 '24

As mentioned, I was not able to get vLLM working. Currently vLLM only supports MI200s and despite doing a fair amount of digging (eg, compiling an RDNA3 compatible Flash Attention from branch from the ROCm repo) on a completely clean system (brand new Ubuntu 22.04.3 HWE + ROCm 6.0 setup yesterday for poking at these Radeon cards) in a mamba env I was unable to get vLLM running properly.

If you're able to get it running, please share your secret.

5

u/noiserr Jan 08 '24

Aye saw that. Will give it a shot over the next few weeks, and will share my findings if I do get it to work. Thanks for putting this together, very informative post!
2
u/FireSilicon Jan 08 '24 edited Jan 08 '24
Or maybe this is the reason?
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
(From Nvidia's 3090 datasheet) Peak FP16 Tensor TFLOPS: 119

u/CardAnarchist Jan 08 '24

Honestly 7900 XTX is tempting..

God the thought of going to AMD though..

Thanks very much for this information. I'll greedily ask for the same tests with a YI 34B model and a Mixtral model as I think generally with a 24GB card those models are the best mix of quality and speed making them the most usable options atm.

11

u/randomfoo2 Jan 08 '24

Here's the 7900 XTX running a Yi 34B. It actually performs a bit better than expected - if it were purely bandwidth limited you'd expect ~25 tok/s, but it actually manages to close the gap a bit to the 3090:

7900 XTX: ``` CUDA_VISIBLE_DEVICES=1 ./llama-bench -m /data/models/gguf/bagel-dpo-34b-v0.2.Q4_0.gguf -p 3968 ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes ggml_init_cublas: found 1 ROCm devices: Device 0: Radeon RX 7900 XTX, compute capability 11.0, VMM: no | model | size | params | backend | ngl | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | ---------- | ---------------: | | llama 30B Q4_0 | 18.13 GiB | 34.39 B | ROCm | 99 | pp 3968 | 595.19 ± 0.97 | | llama 30B Q4_0 | 18.13 GiB | 34.39 B | ROCm | 99 | tg 128 | 32.51 ± 0.02 |

build: b7e7982 (1787) ```

RTX 3090: ``` CUDA_VISIBLE_DEVICES=1 ./llama-bench -m /models/llm/gguf/bagel-dpo-34b-v0.2.Q4_0.gguf -p 3968 ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes ggml_init_cublas: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes | model | size | params | backend | ngl | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | ---------- | ---------------: | | llama 30B Q4_0 | 18.13 GiB | 34.39 B | CUDA | 99 | pp 3968 | 753.01 ± 11.62 | | llama 30B Q4_0 | 18.13 GiB | 34.39 B | CUDA | 99 | tg 128 | 35.48 ± 0.02 |

build: 1fc2f26 (1794) ```

The 20GB 7900 XT OOMs of course.

Personally, if you're looking for a 24GB class card, I still find it a bit hard to recommend the 7900 XTX over a used 3090 - you should still be able to find the latter a bit cheaper, you'll get better performance, and unless you enjoy fighting with driver/compile issues, you will simply have much better across the board compatibility with an Nvidia card.

2

u/shing3232 Jan 09 '24

Why rocm use older build？ If I recall recently, recent Pr does have a little improvement on my system

1

u/iamkucuk Jan 09 '24

Actually you can still go for a used 3090 with MUCH better price, same amount of ram and better performance. It's also better for cutting edge features, and the upcoming optimizations.

2

u/[deleted] Mar 22 '24

used market is slim/scammy right now and prices are up due to lots of people competing.

Prices seem to be about $850 cash for unknown quality 3090 ards with years of use vs $920 for brand new xtx with warranty - when all said and done, the XTX does everything i need (today) and the rough edges are going way where i don't have the hassle of sourcing used gear and having to move to water blocks or redo fans/thermals because of abused 3090s ;)

1

u/iamkucuk Mar 22 '24

Seems like you already convinced yourself. Just be happy with your choice. Hope it can do well for you.

1

u/[deleted] Mar 22 '24

If price reality wasn't so skewed, Nvidia is the easy path, but the Radeon side is quickly becoming worth it.

i wish i could find 3090s for affordable price but that isn't happening right now.

thankfully we're seeing rocm6, vllm, ollama, lm-studio and so many tools finally catch up and get supported

1

u/iamkucuk Mar 23 '24

I just got mine at 530 usd. I think it's the price it should be.

1

u/[deleted] Mar 23 '24

here in Austin used market is 800-900. ebay is a it that price. Facebook may get lucky but probably need to replace fans or move to water block. never seen a 500 dollar 3090 but people keep saying they find them

u/artelligence_consult Jan 08 '24

Jesh, this is bad - AMD really needs to put some juice into ROCm

Given that the bandwith should be the limit - there is NO explanation for the 3090 beating the 7900 XTX, in particular not by that margin (exLlamav2) and in general. Could the the power budget but still - quite disappointing. Really needs some work on that level.

16

u/randomfoo2 Jan 08 '24

I think ROCm isn't really the problem here - the performance (vs the raw hardware specs) obviously shows there is a lot of optimization that needs to happen for the ROCm kernels, but that's not an issue with ROCm - rather the performance difference really comes down to developer resources for AMD architecture. Sure there's improving documentation, improving HIPIFY, providing developers better tooling, etc, but honestly AMD should 1) send free GPUs/systems to developers to encourage them to tune for AMD cards, or 2) just straight out have some AMD engineers giving a pass and contributing fixes/documenting optimizations to the most popular open source projects. llama.cpp is probably the most widely used inferencing engine in the world at this point, and dozens of downstream projects depend on it.

Also, while I get that AMD is focusing on the data center, but the fact that I couldn't get vLLM or TensorFlow to work at on the 7900s simply means that most developers won't bother at all with AMD. I'll just work/tune on my 3090/4090s and know that I can run the exact same code on A6000s, L40s, A100s, and H100s without any issues...

MK1's work on optimizing Instinct cards show that the optimization can be done: https://mkone.ai/blog/mk1-flywheel-amd

Casey Primozic did some poking back in July 2023 showing that with the right kernels, it's possible to hit the theoretical compute rates: https://cprimozic.net/notes/posts/machine-learning-benchmarks-on-the-7900-xtx/#tinygrad-rdna3-matrix-multiplication-benchmark

4

u/noiserr Jan 08 '24

There is also a recent paper from the Frontier folks training a 1T parameters model on 3000 mi250x GPUs: https://arxiv.org/abs/2312.12705

The paper goes into details about exactly what you have mentioned. Optimizing the underlying kernels.

6

u/Aaaaaaaaaeeeee Jan 08 '24

Exllamav2 may be getting further optimized on rocm. - https://old.reddit.com/r/LocalLLaMA/comments/18ourt4/comment/kek7atd/

3

u/shing3232 Jan 08 '24

yee, I mean RAM bandwidth stay at 50% usage for 7900XTX inference

1

u/artelligence_consult Jan 08 '24

Something is off then. See, that would indicate the processing is the bottleneck, but I have a problem with a graphics card with programmable elements being essentially overloaded by a softmax. This indicates some really bad programming - either on the software or (quite likely) on the ROCm part. Which AMD likely will fix soon.

1

u/akostadi Apr 17 '24

Fixing soon for a long time. They don't use opportunity now Intel is a little of their back. I think Intel will reach them soon on the GPU side and in the process help them in the ecosystem. But still, they miss a lot of opportunity before that happens. I'm personally tired of them.

u/[deleted] Jan 08 '24

Very interesting, thanks for going through this.

Note, on Linux, the default Power Limit on the 7900 XT and 7900 XTX is 250W and 300W respectively.

It depends on vendor, my sapphire pulse 7900xt maxes out at 265W sustained, but occasionally boosts above 300. (ROCm 5.7).

u/Plusdebeurre Jan 08 '24

I just recently got a 7900XTX bc I really didn't want to go with Nvidia and I've run into lack of support with pretty essential libraries: vLLM, flashattention2, and bitsandbytes. Of course there are others, but these 3--which some currently have open issues for ROCm support--have made it to where I can't really do much work on it, except basic inferencing with non-quant models. Even the GPTQ versions have a bug, where after the first inference request, the GPU usage stays at 100% until you kill the kernel. I really hope that support comes soon.

1

u/randomfoo2 Jan 08 '24

Yeah, I wasn't able to get vLLM working either and that really hurts, since I was hoping at least to be able to get this card to do inferencing sweeps. I suppose I could FP16 GGUF my models, but it's still a PITA/new code I need to write for my test harness.

My plan this week when I'm up for it is to see if I can get any QLoRA script working. There is a merge for bitsandbytes ongoing: https://github.com/TimDettmers/bitsandbytes/pull/756 - I'll spend a few minutes looking into that but I'm not willing to burn more than an hour or two on it before you, know, getting back to actual work...

1

u/Plusdebeurre Jan 09 '24

Good luck, brother! I tried a bunch of work-arounds for LoRA tuning, but I wasn't able to get around bitsandbytes dependency. If you remember, please update! It would be really helpful to know how you managed

Also, the vLLM support seems pretty close. Have been following the gh issue for the last couple of weeks. Hopefully it won't be too long

1

u/noiserr Jan 08 '24

the GPU usage stays at 100% until you kill the kernel. I really hope that support comes soon.

There is a workaround for this. I described it here: https://old.reddit.com/r/LocalLLaMA/comments/191srof/amd_radeon_7900_xtxtx_inference_performance/kgxvkzh/

Of course there are others, but these 3--which some currently have open issues for ROCm support--have made it to where I can't really do much work on it, except basic inferencing with non-quant models.

Yeah, vLLM and flashattention not being supported on RDNA yet is annoying. llama.cpp works though, with quantized models.

2

u/Plusdebeurre Jan 09 '24

Thank you for this, it did indeed work! hopefully this gets fixed/updated on the next rocm version

1

u/noiserr Jan 09 '24

Glad it fixed it! Yup.. they are aware of it and there is an open issue on it, so I'm sure it will get addressed at some point. Cheers!

u/Aaaaaaaaaeeeee Jan 08 '24

Multi-gpu in llama.cpp has worked fine in the past, you may need to search previous discussions for that.

There is only one or two collaborators in llama.cpp able to test and maintain the code, and exllamav2 developer does not use AMD GPUs yet.

For inferencing (and likely fine-tuning, which I'll test next), your best bang/buck would likely still be 2 x used 3090's.

That would be a treat to see! I don't think people generally post benchmarks for finetuning since there are all too many possible optimizations.

Here are some speeds I am getting with a 3090 with the bleeding edge quantizations. Exllamav2 0.0.11 increased my 2.4&2.5bpw from 20 t/s to 26 t/s:

https://imgur.com/a/dMwE1p4

(This is inference speed, prompt processing not included, recorded in exui)

u/Sabin_Stargem Jan 08 '24

Looking at those numbers for the 4090, I wish the Galax SG I ordered could fit into my super tower. Had to refund it, that thing looked mighty. It only cost $2,052 after taxes.

u/netikas Jan 08 '24

That's a really interesting info.

I really hope that AMD steps up it's game and in 6-12 months they will catch up to at least 3090 in speed. Such a great value products crippled by bad software.

2

u/iamkucuk Jan 09 '24

The first rocm release was 8 years ago, and they managed to come only this far. I wouldn't count on that.

u/a_beautiful_rhind Jan 08 '24

What do you get on 70b in exllama? Also sucks you don't get flash attention. Might be worth trying to compile it for rocm and see what's missing.

2

u/randomfoo2 Jan 09 '24

There is a branch on ROCm's flash-attention fork that is supposed to have RDNA3 support: https://github.com/ROCmSoftwarePlatform/flash-attention/tree/howiejay/navi_support/

I can compile and install it without a problem, but when actually importing it, it has symbol resolution errors. Interestingly, I get different symbol resolution errors in my ExLLamaV2 vs my vLLM environments, and I'm not sure why.

I haven't tested larger models on the ROCm machine yet. I might queue some up for curiousity later, will either post in that AMD GPU doc or my general testing doc: https://docs.google.com/spreadsheets/d/1kT4or6b0Fedd-W_jMwYpb63e1ZR3aePczz3zlbJW-Y4/edit#gid=752855929

1

u/a_beautiful_rhind Jan 09 '24

A100 not looking very impressive on that.

u/CasimirsBlake Jan 08 '24

Top work sir, thanks for your efforts. It looks like 7 series is at least an option that works well enough now, couldn't say that a year ago. But compared to a couple of used 3090s? Still losing.

u/my_aggr Jan 08 '24 edited Jan 09 '24

Can you do these tests on 13b un quantized models or 30b 4bit quantised?

Your tests are small enough that they fit in 5gb of vram.

u/Obvious-River-100 Jan 09 '24

This means that two 7900 XTX are more powerful than a single 4090 RTX

u/rorowhat Mar 17 '24

What is GTT?

u/Anh_Phu Apr 01 '24

RX 7900 XTX still has better p/p?

u/gigaperson Jan 09 '24

Absolute llm beginner question. So does amd need to fix/update rocm to make at least run all the llm apps that nvidia can? (Even if it's slower rocm-cuda emulation should in theory allow that?) or devs need to spend time to actually make 7900 xtx work even if rocm is fixed/updated?

1

u/allergic_to_profit Jan 21 '24

amd gpus will run any model that nvidia gpus can run. If the model is written to depend on cuda (nvidia proprietary api) then you need a version of the model that doesn't depend on cuda, whether you write it or someone else does.

u/riverdep Jan 08 '24

I checked the specs before reading the benchmark, I thought XTX is going to at least beat 3090 given its bigger bandwidth and FLOPS. How is it so bad? Furthermore, how can 4090 almost 2x on prompt eval?

Results of exllamav2 seem weird though, seems like it’s poorly optimized on both platforms. I just skimmed through the readme of exllamav2, they claimed close to 200 tok/s for both the llama 7B GPTQ and Llama2 EXL2 4.0 bpw model. While you only get 137.6 tok/s, am I missing something here?

5

u/randomfoo2 Jan 08 '24

ExLlama (and I assume V2 as well) has big CPU bottlenecks. I believe turboderp does his benchmarking on a 13900K, while my 4090 is on a 5950X (which is about 30% slower on single-threaded perf) which I assume explains the difference. Lots of people have GPUs, so they can post their own benchmarks if they want.

While ExLlamaV2 is a bit slower on inference than llama.cpp on my system, as you can see it crushes across the board on prompt evaluation - it's at least about 2X faster for every single GPU vs llama.cpp, and it's one of the reasons you should probably prefer ExLlamaV2 if you use LLMs for extended multi-turn conversations.

Why is it so fast? Well, you'll have to go through the commits yourself: https://github.com/turboderp/exllamav2/commits/master/exllamav2/exllamav2_ext/cuda

Why are the AMD cards so slow? At an architectural level AMD and Nvidia's GPU cores differ (duh) and would require separate low-level tuning, which most projects have not done (a bit of a catch-22, but AMD not providing support for any cards developers would have access to, and most end-users not being able to use the code anyway (ROCm platform support has been, and while improving, remains terrible) I think explains most of it):

https://gpuopen.com/learn/amd-lab-notes/amd-lab-notes-matrix-cores-README/

https://gpuopen.com/learn/wmma_on_rdna3/

https://github.com/ROCm/composable_kernel/discussions/1032

1

u/riverdep Jan 08 '24

ohhh now it makes sense. Thank you for the informative reply!

u/OuchieOnChin Jan 08 '24

May I ask with what arguments you achieved 136t/s on a 3090 with llama.cpp? I can only do 116t/s generating 1024 tokens with this command line:

main.exe --model ./llama-2-7b.Q4_0.gguf --ignore-eos --ctx-size 1024 --n-predict 1024 --threads 10 --random-prompt --color --temp 0.0 --seed 42 --mlock --n-gpu-layers 999

2

u/randomfoo2 Jan 08 '24

For benchmarking you should use `llama-bench` not `main`. The only argument I use besides the model is `-p 3968` as I standardize on 3968 tokens of prompt (and the default 128 tokens of inference afterwards, 4K context) for my personal tests.

These benchmarks are all run on headless GPUs. Based on your "main.exe", you're running Windows, and likely running your GUI/frame buffer on the same GPU? This contention will inevitably drive down your inference performance. If you want maximum performance 1) run Linux (CUDA is faster on Linux) and 2) don't run anything else on the GPU when you're running inference loads. If those don't work, upgrade your CPU as could be a bottleneck as well.

(Or don't worry about a 10-15% speed difference. batch=1 at 115t/s vs 135t/s is rather pointless IMO.)

1

u/OuchieOnChin Jan 09 '24

This is very informative, with the llama-bench executable and your parameters I managed 127t/s. I guess I can easily handwave any remaining difference as being caused by the points you mentioned.

u/iamkucuk Jan 09 '24

You conclude a myth with only 1 post. Great work!

u/allergic_to_profit Jan 21 '24

Thanks for the guide. I just got it running on my machine and it was good to see your performance to compare against mine and see if I got it set up right.

u/ingarshaw Feb 18 '24

It is a good comparison of GPUs. But if you want to compare inference speed of llama.cpp vs ExLLamaV2, then it is not correct.
GPTQ is not 4 bpw, it is more. It is between GGUF Q4_K_M and Q4_K_L.

AMD Radeon 7900 XT/XTX Inference Performance Comparisons Resources

llama.cpp

ExLLamaV2

You are about to leave Redlib

You are about to leave Redlib