ROCm vs Vulkan on the RX 7900 XTX

I benchmarked both llama.cpp backends on a 24GB RX 7900 XTX. The answer isn't "pick one" — it depends entirely on whether your model is dense or MoE.

I finally swapped the old RTX 3060 for a 24GB RX 7900 XTX, and the obvious first question was one nobody answers cleanly: for local inference with llama.cpp, do you run ROCm or Vulkan? After a weekend of benchmarking, the answer is — it depends on the model architecture. Here’s what I found.

The two backends

Both backends ship in the same llama.cpp tree. ROCm uses AMD’s HIP layer, which maps closely to CUDA — it’s the “native” GPU path and generally what people mean when they say “AMD GPU acceleration.” Vulkan is the cross-platform graphics API path, less AMD-specific but increasingly competitive thanks to cooperative matrix kernel support landing in recent builds.Build flags

The ROCm build needs the right GPU target for RDNA3. Getting this wrong is the most common mistake — the build succeeds but silently falls back to CPU at runtime.

# RDNA3 = gfx1100. Wrong target = silent CPU fallback
cmake -B build-rocm -DGGML_HIP=ON -DAMDGPU_TARGETS=gfx1100 \
      -DCMAKE_BUILD_TYPE=Release
cmake --build build-rocm --config Release -j $(nproc)

The Vulkan build is simpler — no target to get wrong:

cmake -B build-vulkan -DGGML_VULKAN=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build-vulkan --config Release -j $(nproc)

The results

Same prompt, 512-token generation, averaged over five runs. The pattern is consistent across multiple model pairs.

ModelBackendtok/s
Qwen3.6-35B-A3B (MoE)Vulkan71.4
Qwen3.6-35B-A3B (MoE)ROCm63.8
Qwen3.6-27B (dense)Vulkan28.1
Qwen3.6-27B (dense)ROCm34.6

Why the split?

MoE models route tokens through sparse expert layers. Vulkan’s cooperative matrix kernels handle the irregular memory access patterns of MoE better on RDNA3. Dense models are a more uniform workload where ROCm’s tighter HIP-to-hardware mapping wins. Once I understood the pattern it became obvious: keep both builds around and pick per model type.

Takeaway

MoE models → Vulkan. Dense models → ROCm. IQ4_XS quantization beats Q4_K_M on both backends at the same VRAM budget — that’s a separate post, but worth knowing before you download anything. Next up: MTP speculative decoding, which roughly doubled dense throughput in my tests.