MUX: Continuous Reasoning via Multiplexed Tokens

TL;DR

MUX compresses discrete reasoning chains into continuous latent tokens via lossless linear superposition — achieving state-of-the-art latent reasoning across all 32 evaluation settings while using 2.4–5.9× fewer reasoning tokens than standard chain-of-thought.

Abstract

Language models solve complex problems by articulating intermediate reasoning steps in natural language. While effective, this process is computationally bottlenecked: each reasoning step conveys only a single subword, and many are spent expressing a thought instead of carrying out computation.

We propose MUX, a simple method for high-bandwidth and compact reasoning based on distillation of discrete reasoning into continuous multiplexed tokens in a latent space. Here, each latent token is trained to represent a weighted linear superposition (multiplexing) of a span of discrete reasoning subwords, where this superposition is lossless by construction and the span can be fully recovered (demultiplexing).

We prove that simple position-dependent weightings, such as suitable geometric decay, support lossless multiplexing, which in turn prevents shortcut behaviors caused by latent collapse. We further show that multiplexed reasoning can perform parallel exploration in problems that require search.

Across 32 evaluation settings spanning four language models, MUX outperforms strong latent reasoning baselines. Ablation and probing analyses further show that the learned latent tokens encode faithful and interpretable reasoning. Our results suggest that lossless superposition as local learning targets constitutes a sufficient condition for achieving strong and efficient latent continuous reasoning.

Key Contributions

01

Latent reasoning via multiplexed tokens

A local distillation method for continuous latent reasoning based on multiplexed targets. For each latent token, we define a vocabulary-space target by taking a position-weighted linear superposition of one-hot encodings in its corresponding discrete reasoning span.

02

Lossless multiplexing

We identify simple classes of positional weightings (geometric, sinusoidal, rotary) that guarantee lossless multiplexing via a subset-sum separation condition. Lossless multiplexing prevents shortcut behaviors caused by latent collapse.

03

Parallel search via multiplexing

Multiplexed tokens are expressive enough to represent and update multiple hypotheses simultaneously, implementing each BFS step using a single latent token. Parallel search can naturally emerge from serial supervision.

04

State-of-the-art results

Best latent reasoning method across 32 mathematical reasoning settings spanning two training corpora, four language models, and four test sets. Surpasses strong discrete and continuous reasoning baselines on two search benchmarks.

Method

Lossless multiplexing of a span «5+3=8» through position-weighted linear superposition.

Given a discrete reasoning span (r₁, ..., r_S), MUX constructs a vocabulary-space target via:

\[ \operatorname{mux}(r) = \sum_{j=1}^{S} \alpha_j \, \operatorname{onehot}(r_j) \]

where the coefficients α_j are position-dependent weights normalized to lie on the vocabulary simplex. The model is trained to match these targets via KL divergence through a linear-softmax head.

We prove that geometric, sinusoidal, and rotary weightings all support lossless multiplexing—the original span can be exactly recovered from the superposition.

Positioning of MUX

Method	Supervision	Lossless	Shortcut-free	Train Eff.	Infer. Eff.	Interpretable
SFT-CoT	Discrete	✓	✓	✓	✗	✓
CODI	Global	✗	✗	✓	✓	✗
SIM-CoT	Local	✓	✓	✗	✓	✓
KaVa	Local	✗	✓	✓	✓	✓
MUX	Local	✓	✓	✓	✓	✓

Results

32/32

Best latent reasoning across all evaluation settings

15

Settings where MUX outperforms discrete SFT-CoT

2.4–5.9×

Fewer reasoning tokens than SFT-CoT

Mathematical Reasoning

Test accuracies (%). Underlined when MUX outperforms SFT-CoT. MUX reports ±1 std over 3 seeds.

Method	GSM8K-AUG				GSM8K-AUG-NL
Method	ID	SVAMP	GSM-Hard	MultiArith	ID	SVAMP	GSM-Hard	MultiArith
GPT-2
SFT-CoT	44.1	41.8	9.8	90.7	34.2	36.9	7.1	88.7
CODI	43.7	42.9	9.9	92.8	34.1	30.8	6.8	58.9
SIM-CoT	42.6	42.6	9.4	92.8	30.9	27.5	6.5	53.9
MUX	48.1	45.0	10.6	93.0	37.4	36.7	8.9	72.4
LLaMA 3.2 1B-Instruct
SFT-CoT	61.6	66.7	15.6	99.3	53.2	62.9	13.3	98.5
Coconut	45.3	48.8	9.9	90.1	24.2	—	—	—
CODI	55.6	61.1	12.8	96.1	47.9	55.3	11.3	96.7
SIM-CoT	56.1	61.5	12.7	96.2	28.4	43.0	6.6	59.4
MUX	56.7	63.6	13.0	98.5	50.3	57.5	11.6	96.9

Scaling to Larger Models (GSM8K-AUG)

Method	LLaMA 3.2 3B				LLaMA 3.1 8B
Method	ID	SVAMP	GSM-Hard	MultiArith	ID	SVAMP	GSM-Hard	MultiArith
SFT-CoT	71.5	71.0	17.0	98.3	71.7	73.1	16.5	98.3
CODI	60.8	73.3	14.3	98.7	61.1	78.1	15.5	99.5
SIM-CoT	62.3	74.9	14.6	98.8	64.1	79.4	16.3	100.0
MUX	65.0	77.1	15.2	100.0	68.1	80.1	17.1	100.0

Parallel Search

Search accuracies (%) averaged over 3 seeds.

Method	MNNS	Game of 24
No-CoT	68.4	74.4
SFT-CoT	84.6	84.3
Coconut	92.8	78.6
CoT²	98.9	85.0
MUX	99.6	88.7

Interpretability Analysis

Through probing analysis, we show that MUX latent tokens encode faithful and interpretable reasoning content. By projecting latent tokens through the LM head, the top-decoded subwords closely match the aligned discrete reasoning spans.

Qualitative interpretability results on mathematical reasoning

Mathematical reasoning (GSM8K-AUG): Top-5 LM-head decoded subwords per latent token. MUX recovers interpretable reasoning content.

Parallel search (MNNS): Latent tokens encode the full search frontier, demonstrating parallel exploration of multiple hypotheses in superposition.

Quantitative interpretability metrics on math

Mathematical reasoning metrics

Parallel search metrics (trace & frontier)

Attention Analysis

Attention routing through continuous reasoning tokens. Attention flows from the answer token through the latent reasoning sequence, showing that latent tokens actively contribute to the final prediction.

BibTeX

@misc{suleymanzade2026mux,
  title={{MUX}: Continuous Reasoning via Multiplexed Tokens},
  author={Suleymanzade, Ayhan and Gozeten, Halil Alperen and Bronstein, Michael and Ceylan, \.{I}smail \.{I}lkan and Kim, Jinwoo},
  year={2026},
  note={Forthcoming}
}