ROCm vs Vulkan on the RX 7900 XTX

I finally swapped the old RTX 3060 for a 24GB RX 7900 XTX, and the obvious first question was one nobody answers cleanly: for local inference with llama.cpp, do you run ROCm or Vulkan? After a weekend of benchmarking, the answer is — it depends on the model architecture. Here’s what I found.

The two backends

Both backends ship in the same llama.cpp tree. ROCm uses AMD’s HIP layer, which maps closely to CUDA — it’s the “native” GPU path and generally what people mean when they say “AMD GPU acceleration.” Vulkan is the cross-platform graphics API path, less AMD-specific but increasingly competitive thanks to cooperative matrix kernel support landing in recent builds.Build flags

The ROCm build needs the right GPU target for RDNA3. Getting this wrong is the most common mistake — the build succeeds but silently falls back to CPU at runtime.

# RDNA3 = gfx1100. Wrong target = silent CPU fallback
cmake -B build-rocm -DGGML_HIP=ON -DAMDGPU_TARGETS=gfx1100 \
      -DCMAKE_BUILD_TYPE=Release
cmake --build build-rocm --config Release -j $(nproc)

# RDNA3 = gfx1100. Wrong target = silent CPU fallback
cmake -B build-rocm -DGGML_HIP=ON -DAMDGPU_TARGETS=gfx1100 \
      -DCMAKE_BUILD_TYPE=Release
cmake --build build-rocm --config Release -j $(nproc)

The Vulkan build is simpler — no target to get wrong:

cmake -B build-vulkan -DGGML_VULKAN=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build-vulkan --config Release -j $(nproc)

cmake -B build-vulkan -DGGML_VULKAN=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build-vulkan --config Release -j $(nproc)

The results

Same prompt, 512-token generation, averaged over five runs. The pattern is consistent across multiple model pairs.

Model	Backend	tok/s
Qwen3.6-35B-A3B (MoE)	Vulkan	71.4
Qwen3.6-35B-A3B (MoE)	ROCm	63.8
Qwen3.6-27B (dense)	Vulkan	28.1
Qwen3.6-27B (dense)	ROCm	34.6

Why the split?

MoE models route tokens through sparse expert layers. Vulkan’s cooperative matrix kernels handle the irregular memory access patterns of MoE better on RDNA3. Dense models are a more uniform workload where ROCm’s tighter HIP-to-hardware mapping wins. Once I understood the pattern it became obvious: keep both builds around and pick per model type.

Takeaway

MoE models → Vulkan. Dense models → ROCm. IQ4_XS quantization beats Q4_K_M on both backends at the same VRAM budget — that’s a separate post, but worth knowing before you download anything. Next up: MTP speculative decoding, which roughly doubled dense throughput in my tests.

The two backends

The results

Why the split?

Takeaway

Related Posts

Why Hermes Agent Is Eating Your Context Window