Back to Blog

Fast Inference: Running Qwen3.5-27B on a Single RTX 3090 with DFlash & DDTree

Technical Notes15 min readGenAI Solutions Team
QwenDFlashDDTreeLLM InferenceRTX 3090Speculative DecodingGGUFLocal LLM
Fast Inference: Running Qwen3.5-27B on a Single RTX 3090 with DFlash & DDTree

TLDR

DFlash and DDTree are two cutting-edge inference acceleration techniques that, when combined, enable 3.4x faster token generation on the Qwen3.5-27B language model using only a single NVIDIA RTX 3090 (24GB VRAM). DFlash replaces the traditional draft model in speculative decoding with a block diffusion transformer that predicts multiple tokens in a single forward pass, conditioned on hidden states extracted from the target model (Z Lab, UC San Diego). DDTree extends DFlash by constructing an optimal draft tree from the diffusion output distributions, allowing the target model to verify multiple candidate paths in one pass via tree attention (Technion). Together, these techniques push inference throughput from ~38 tokens/second (autoregressive baseline) to over 129 tokens/second on HumanEval -- all within the tight 24GB memory budget of a consumer GPU, made possible by Q4_K_M GGUF quantization and custom CUDA kernels from the Lucebox project.

The Inference Bottleneck

Large language models generate text autoregressively: one token at a time, each dependent on all previous tokens. This sequential nature means the throughput is fundamentally limited by the time it takes to run a single forward pass through the model. For Qwen3.5-27B on an RTX 3090, that baseline is approximately 38 tokens/second -- acceptable for casual chat, but too slow for bulk code generation, batch data processing, or any workload requiring sustained output.

Several approaches exist to speed things up:

  • Quantization (INT8, INT4, GGUF) -- reduces memory and compute per forward pass
  • KV cache optimization -- reduces redundant computation across tokens
  • Speculative decoding -- draft multiple tokens cheaply, then verify them in one target forward pass

Speculative decoding has been the most practical approach for consumer hardware. Traditional methods like EAGLE-3 use a small autoregressive draft model to propose N tokens sequentially, which the target model then accepts or rejects in a single pass. But even the draft model is autoregressive -- generating N draft tokens still takes N forward passes through the draft, capping practical speedups at 2-3x.

DFlash and DDTree fundamentally rethink both the drafting and verification stages.

What is DFlash?

DFlash (arXiv:2602.06036, February 2026) is a speculative decoding framework from Z Lab at UC San Diego (Jian Chen, Yesheng Liang, Zhijian Liu) that replaces the autoregressive draft model entirely with a block diffusion model. The core insight from the paper is straightforward but powerful: "The target knows best."

The Core Insight

During the prefill phase of autoregressive decoding, the target model processes the entire prompt context. The hidden states across all layers of the transformer encode rich information about what tokens are likely to come next. DFlash extracts these hidden representations and uses them to condition a small draft model that can predict an entire block of tokens (e.g., 16 tokens) in a single forward pass.

Compare the drafting costs:

MethodDraft Cost for N Tokens
Autoregressive (EAGLE-3)N forward passes through draft
DFlash (block diffusion)1 forward pass (regardless of N)

A 5-layer DFlash draft generating 16 tokens has lower latency than a 1-layer EAGLE-3 draft generating 8 tokens, because the diffusion model processes all positions in parallel.

How It Works -- Key Algorithms

1. Context Feature Extraction

During prefill, DFlash extracts hidden states from layers uniformly sampled across the target model's depth. For Qwen3.5-27B (which has 64 layers), DFlash extracts from layers {1, 16, 31, 46, 61}. These representations are concatenated and projected through a linear layer + RMSNorm into a compact target context feature -- a dense vector that summarizes what the target model "expects" to generate next.

2. KV Injection (Not Input Fusion)

Traditional speculative decoding methods like EAGLE-3 fuse target features with token embeddings at the input. This signal dilutes as it propagates through deep networks. DFlash takes a different approach: it injects the fused target context directly into the Key and Value projections of every draft layer. This provides persistent conditioning throughout the draft model, ensuring the target's signal remains strong at every depth.

3. Parallel Block Diffusion Drafting

The draft model is a small non-causal transformer (only 5 layers for Qwen3.5-27B) trained as a block diffusion model. It takes a partially masked input block like [bonus_token, mask, mask, ..., mask] and predicts all masked positions simultaneously in one forward pass. The diffusion training means the model learns to denoise a block of token positions from random noise into coherent text, conditioned on the extracted target context features.

4. Training Innovations

DFlash introduces several training techniques that improve quality:

  • Random anchor sampling instead of uniform block division -- the model learns to generate from arbitrary starting positions
  • Position-dependent loss weighting: w_k = exp(-(k-1)/gamma) -- emphasizes early positions in the block, which are more likely to be accepted
  • Shared embedding and LM head with the target model (frozen) -- ensures the draft and target operate in the same vocabulary space
  • Only draft transformer layers are trained -- the draft adds just ~3.46 GB of parameters for Qwen3.5-27B

DFlash Performance (Research Benchmarks)

The original DFlash paper benchmarks on H200 GPUs show significant improvements over autoregressive drafting:

ModelAutoregressiveEAGLE-3DFlashDFlash Speedup
Qwen3-8B1x2.4x6.1x2.5x faster than EAGLE-3
Qwen3-4B1x~2x4.91xtau=6.54 acceptance length

Average acceptance length reaches 6-8 tokens per cycle versus 3-4 for EAGLE-3, meaning the target model accepts significantly more draft tokens per verification pass.

What is DDTree?

DDTree (arXiv:2604.12989, April 2026) is a tree-based verification method from Technion (Liran Ringel, Yaniv Romano) that builds on DFlash to extract even more throughput from the same forward pass.

The Core Insight

Vanilla DFlash generates a full block of draft tokens but only verifies a single linear trajectory. The block diffusion forward pass produces a probability distribution over tokens at each position, but vanilla DFlash discards all but the top-1 prediction at each position. DDTree recovers this wasted information.

Instead of a single chain, DDTree uses the per-position distributions to construct an optimal draft tree, then verifies all branches in one target model forward pass using tree attention.

How It Works -- Key Algorithms

1. Factorized Distribution

A single block diffusion pass produces marginal distributions {q_i} for each position. These are not path-conditioned probabilities (the diffusion model treats positions independently), but they still contain useful signal. DDTree defines a factorized distribution:

Q(y|c, b) = product of q_i(y_i | c, b)

where c is the context, b is the bonus token, and y is the sequence of draft tokens.

2. Surrogate Objective with Mathematical Guarantee

The ideal objective is to maximize expected acceptance length under the target model's true probabilities, but those are unavailable during tree construction. DDTree uses the factorized distribution as a surrogate and proves (Proposition 1 in the paper) that this decomposes into an additive sum:

E[alpha_T(Y)] = sum_{u in T} q(u | c, b)

This means selecting an optimal draft tree reduces to choosing the top-B probability prefixes -- a tractable problem.

3. Best-First Heap Algorithm

Rather than enumerating exponentially many prefixes, DDTree uses a max-heap to efficiently extract the top-B prefixes:

  1. Start with the most probable token at position 1
  2. When popping a prefix from the heap, push its next sibling (alternative token at the last position) and first child (extend with the most probable next token)
  3. Repeat B times

Complexity: O(B log B) -- negligible overhead compared to the target model forward pass.

4. Tree Verification

The selected tree is flattened into input tensors and verified with tree attention (ancestor-only mask). The target model walks through the tree, determining which path to accept. Because all branches are verified in parallel, the cost is essentially the same as verifying a single chain.

DDTree Performance Gains

DDTree improves every single benchmark entry across all 60 dataset-model-temperature combinations in the paper:

BenchmarkModelDFlash AloneDFlash + DDTreeImprovement
HumanEvalQwen3-8B4.84x6.90xtau: 6.61 -> 9.67
MATH-500Qwen3-8B5.56x7.52xtau: 7.79 -> 10.73
LiveCodeBenchQwen3-8B5.02x7.10x--

DDTree consistently adds 30-40% more speedup on top of DFlash by turning a single draft chain into a rich search tree.

Running on Consumer Hardware: The Lucebox Project

The original DFlash implementation from Z Lab targets BF16 precision on B200 GPUs (60+ GB VRAM). Making this work on a single RTX 3090 with 24GB of VRAM required significant engineering effort. The Lucebox project (github.com/Luce-Org/lucebox-hub) is the first implementation to bridge this gap.

The Memory Budget Challenge

Fitting Qwen3.5-27B plus the DFlash draft model plus DDTree verification state into 24GB requires careful memory accounting:

ComponentFormatMemory Usage
Qwen3.5-27B target modelQ4_K_M GGUF~16 GB
DFlash draft modelBF16~3.46 GB
DDTree verify state (budget=22)BF16~0.5 GB
KV cache + overheadMixed~4 GB
Total~24 GB

This fits exactly -- and only just. BF16 for the target model would require ~54 GB total. AWQ INT4 leaves no room for DDTree state. Q4_K_M GGUF is the sweet spot, providing the best quality-to-memory ratio.

Why Q4_K_M

Q4_K_M (a 4-bit quantization variant used by llama.cpp/GGUF) quantizes weights in groups of 32, using 4 bits per weight with per-group scaling factors. The "K" variant uses a mixed approach where some layers use slightly higher precision for weights that are more sensitive to quantization error. For Qwen3.5-27B:

  • Q4_K_M: ~16 GB (fits)
  • Q8_0: ~32 GB total (does not fit)
  • BF16: ~54 GB total (does not fit)

The quality loss from Q4_K_M is minimal for inference -- typically imperceptible in generated text -- while enabling the entire pipeline to run on consumer hardware.

Custom CUDA Kernels

Lucebox implements approximately 2000 lines of C++/CUDA on top of ggml (the tensor library underlying llama.cpp), including three new tree-mode operations in a forked llama.cpp:

  • ggml_ssm_conv_tree -- tree-aware SSM convolution for handling Mamba/SSM layers in tree mode
  • ggml_gated_delta_net_tree -- tree-aware gated delta network operations
  • **ggml_gated_delta_net_tree_persist` -- persistent state variant for long-context inference

These kernels enable the target model to process tree-structured input during the verification phase, something that standard llama.cpp does not support.

128K Context on 24GB

Lucebox achieves 128K context length on the 24GB RTX 3090 through:

  • Q4_0 KV cache compression -- compresses the key-value cache to 4-bit
  • Sliding target_feat ring -- maintains a 4096-slot ring buffer of extracted target features

This achieves 128K context with only ~3% acceptance length degradation compared to FP16 KV cache, while saving 8x memory on the cache.

Benchmarks on RTX 3090 (Qwen3.5-27B)

The Lucebox project provides comprehensive benchmarks comparing DFlash+DDTree against the autoregressive baseline and alternative approaches:

Throughput Comparison

BenchmarkAutoregressive (tok/s)DFlash+DDTree (tok/s)Acceptance LengthSpeedup
HumanEval (10 prompts, n_gen=256)37.78129.528.313.43x
Math50037.71110.517.042.93x
GSM8K37.6596.156.142.55x
Demo peak (single prompt)38.0207.6--5.46x

The demo peak of 207.6 tok/s demonstrates the ceiling of DFlash+DDTree under ideal conditions. Real-world workloads (HumanEval, Math500) sustain 96-130 tok/s depending on the task's token distribution.

Comparison with Other Approaches on RTX 3090

ApproachThroughput (HumanEval)Notes
DFlash+DDTree (Lucebox)129.5 tok/sBest overall
Chain speculative decoding (EAGLE)~113 tok/sDDTree adds ~15%
SGLang AWQ~46 tok/s2.8x slower
llama.cpp autoregressive~38 tok/sBaseline

DFlash+DDTree significantly outperforms SGLang with AWQ quantization and even chain-based speculative decoding, making it the fastest known approach for running Qwen3.5-27B on consumer hardware.

Context Length Impact

Longer contexts naturally slow down inference due to increased attention computation. Here's how decode speed varies with context length:

Context LengthDecode Speed (tok/s)Prefill Time
520 tokens~1300.06s
32K~8538s
64K~18126s
128K~15-20~10 min

For short-to-medium context workloads (up to 32K), DFlash+DDTree maintains excellent throughput. At 128K context, the prefill phase becomes the bottleneck, but decode speed remains usable at ~15-20 tok/s.

Qwen3.6-27B Cross-Compatibility

Qwen3.6-27B shares the same architecture as Qwen3.5-27B, allowing it to load as a drop-in target model. However, the DFlash draft was trained on Qwen3.5 hidden states, resulting in reduced acceptance:

Target ModelBenchmarkAcceptance LengthAccept Ratetok/s
Qwen3.5-27BHumanEval8.33~65%134.78
Qwen3.6-27BHumanEval4.7430.6%73.67

Even with reduced acceptance, Qwen3.6-27B still achieves 73.67 tok/s -- nearly 2x the autoregressive baseline. Training a Qwen3.6-specific draft model would restore full performance.

Setup Instructions (RTX 3090)

Prerequisites

  • NVIDIA RTX 3090 (24GB VRAM, sm_86 architecture)
  • CUDA 12+ toolkit
  • CMake 3.18+
  • Python 3.10+

Installation

# Clone the repository with submodules
git clone --recurse-submodules https://github.com/Luce-Org/lucebox-hub
cd lucebox-hub/dflash

# Build the CUDA kernels (~3 minutes on RTX 3090)
cmake -B build -S . \
  -DCMAKE_CUDA_ARCHITECTURES=86 \
  -DCMAKE_BUILD_TYPE=Release
cmake --build build --target test_dflash -j

Download Models

# Download Qwen3.5-27B target model (Q4_K_M GGUF, ~16 GB)
huggingface-cli download unsloth/Qwen3.5-27B-GGUF \
  Qwen3.5-27B-Q4_K_M.gguf \
  --local-dir models/

# Download DFlash draft model (BF16, ~3.46 GB)
huggingface-cli download z-lab/Qwen3.5-27B-DFlash \
  model.safetensors \
  --local-dir models/draft/

Total download: approximately 19.5 GB.

Running Inference

# Single prompt inference
python3 scripts/run.py --prompt "def fibonacci(n):"

# Start an OpenAI-compatible API server
python3 scripts/server.py --port 8000 --daemon

# Test with curl
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3.5-27b",
    "messages": [{"role": "user", "content": "Write a Python function to compute Fibonacci numbers"}],
    "temperature": 0.7
  }'

Configuration Tuning

The DDTree budget controls the tradeoff between verification cost and acceptance length. For RTX 3090:

  • Budget=22 (recommended) -- fits within 24GB, optimal for this hardware
  • Budget=256-512 -- optimal for H200+ GPUs with more memory
  • Lower budget = less memory, shorter acceptance length
  • Higher budget = more memory, longer acceptance length (diminishing returns)

Complementary Optimizations

TurboQuant KV Cache Compression

Google DeepMind's TurboQuant (ICLR 2026) compresses KV cache to 3-bit using Walsh-Hadamard Transform + Lloyd-Max quantization (github.com/AmesianX/TurboQuant). When combined with DFlash, TurboQuant enables even longer contexts on 24GB VRAM by reducing the KV cache footprint beyond what Q4_0 achieves.

Why This Matters for Production

The combination of DFlash, DDTree, Q4_K_M GGUF quantization, and custom CUDA kernels represents a significant milestone: 27B-parameter models are now practical for real-time applications on a single consumer GPU. Use cases include:

  • Local code generation -- 129 tok/s sustained throughput makes interactive coding assistance practical
  • Batch data processing -- bulk classification, summarization, or extraction tasks complete in a fraction of the time
  • Self-hosted AI assistants -- run a 27B model locally with response times competitive with cloud APIs
  • Edge deployment -- the same approach works on any sm_86+ GPU, enabling deployment in environments without cloud access

Trade-offs and Limitations

DFlash+DDTree is not without trade-offs:

  • Hardware specificity -- the custom CUDA kernels are built for sm_86 (RTX 3090/3080 Ti). Other architectures require recompilation and may need kernel adjustments
  • Draft model dependency -- the DFlash draft is trained specifically for Qwen3.5-27B. Using a different target model (Qwen3.6, other architectures) results in reduced acceptance
  • Memory tightness -- the 24GB budget is tight. Running any additional processes on the GPU (system compositor, other workloads) may cause OOM errors
  • Long context prefill -- at 128K context, the prefill phase takes ~10 minutes. DFlash+DDTree accelerates the decode phase, not the prefill
  • Software maturity -- the Lucebox implementation is early-stage. It forks llama.cpp with custom ops not yet upstreamed

Conclusion

DFlash and DDTree represent a paradigm shift in LLM inference optimization. By replacing autoregressive drafting with block diffusion and linear verification with tree-based search, they achieve 3.4x speedup on real-world workloads for Qwen3.5-27B on a single RTX 3090 -- a consumer GPU with 24GB VRAM that, just months ago, could barely run this model at all.

The key technical breakthroughs are:

  1. DFlash's block diffusion draft -- predicts 16 tokens in one forward pass by conditioning on target model hidden states, eliminating the sequential bottleneck of autoregressive drafting
  2. DDTree's tree verification -- recovers the information discarded by vanilla DFlash's top-1 selection, building an optimal draft tree that the target verifies in a single pass
  3. Q4_K_M GGUF quantization -- compresses the 27B target model to ~16 GB, making the entire pipeline fit within 24 GB alongside the draft model and verification state

The Lucebox project demonstrates that research-grade inference acceleration is no longer exclusive to data centers with H200 clusters. With approximately 2000 lines of custom CUDA code and careful memory management, these techniques work on hardware available to anyone with ~$2,000 and a PCIe slot.

For developers and organizations deploying large language models on consumer or edge hardware, DFlash+DDTree is the current state of the art. As draft models become available for more target architectures and the custom kernels mature, expect these techniques to become the default for local LLM serving.

Resources: