2506.21734 HIERARCHICAL REASONING MODEL

Published on August 6, 2025

hrm

Note to self dyb:

Strip away the prose and HRM boils down to:

Two reusable “loops” (L1 & L2) • L1 = fast, local, resets every T steps • L2 = slow, global, updates once per T steps, then broadcasts a new context to L1

A tiny data recipe • ~1 000 (input, output) pairs per task—no pre-training, no human scratchpads.

One cheap training trick • Back-propagate only through the last state of each loop → constant memory, no BPTT.

Optional “think longer” switch • Q-learning head decides when to stop (ACT), so harder puzzles get more L1-L2 cycles at inference.

Everything else (brain analogy, dimensionality stories, etc.) is supporting narrative.

Five-Lens TL;DR

Expert: A 27 M-parameter recurrent architecture that replaces Chain-of-Thought with two nested recurrence loops (fast low-level + slow high-level). It eliminates BPTT, attains depth via hierarchical convergence, and solves ARC-AGI, Sudoku-Extreme, and 30×30 maze search at 40–80 % with only ~1,000 examples—no pre-training, no CoT.
Practitioner: Ships as a compact PyTorch module (1-step gradient, constant memory) that you can fine-tune on 1k examples in <2 GPU-days. Inference scales linearly with compute budget; add more ‘thinking steps’ for harder puzzles without re-training.
General Public: Imagine a tiny AI that can finish expert-level Sudoku or draw the shortest path through a giant maze after seeing only 1,000 practice puzzles—no need for the huge models or ‘thinking out loud’ tricks used by ChatGPT.
Skeptic: Impressive, but only tested on synthetic puzzles. No evidence on open-ended language tasks; 1k examples may still be too many for real-world adoption; and the brain analogy is suggestive, not proven.
Decision-Maker: Low-data, low-parameter path to high-stakes planning tasks (logistics, chip layout, verification). If verified on industrial problems, could cut cloud-inference cost 10–100× versus frontier LLMs.

2. What Real-World Problem Is This Solving?

Current LLMs need: - Large pre-training corpora (terabytes) - Explicit Chain-of-Thought supervision (human-written scratchpads) - High token budgets (latency & cost)

HRM attacks the “data-hungry, token-hungry, compute-hungry” triad for algorithmic reasoning tasks where the answer must be exact (e.g., Sudoku grids, shortest path, ARC pattern completion).

3. Counterintuitive Take-Aways

Smaller is deeper: 27 M params beats 175 M-1 T param LLMs on tree-search tasks.
CoT is optional, not mandatory—latent hidden-state reasoning suffices.
Biological inspiration (theta-gamma coupling) yields better engineering metrics than pure scaling.

4. Jargon-to-Plain Dictionary

Hierarchical Reasoning Model (HRM): “A mini-brain with two gears: a slow planner (high-level) and a fast executor (low-level) that talk every few milliseconds.”
Hierarchical convergence: “Let the fast gear finish its mini-task, then reset it with new instructions; repeat until the slow gear is satisfied.”
1-step gradient approximation: “Learn by looking only at the final move, not replaying every step—like learning chess by checking the end position.”
Deep supervision: “Grade the student after every short exam, not only after the whole course.”
Adaptive Computation Time (ACT): “Hard puzzle? Think longer; easy puzzle? stop early—automatically.”

5. Methodology Highlights

Architecture - Two encoder-only Transformer blocks (identical size) arranged as: - Low-level (L) updates every step; High-level (H) updates every T steps (T≈4–8). - Input/Output: sequence-to-sequence, flattened 2-D grids. - Post-Norm, RMSNorm, RoPE, GLU—modern LLM best-practices.

Training - Random initialization; no pre-training. - 1k examples per task (ARC, Sudoku-Extreme, Maze-Hard). - Loss: cross-entropy + binary cross-entropy for ACT Q-values. - Optimizer: Adam-atan2, constant LR, weight decay. - Memory: O(1) backprop via 1-step gradient approximation inspired by Deep Equilibrium Models (IFT + Neumann truncation).

Inference - Halting controller (Q-head) chooses when to stop. - Extra compute budget can be added at test-time without re-training.

6. Quantified Results (±95 % bootstrap CIs where available)

Task	Metric	HRM (27 M)	Best CoT LLM	Δ
ARC-AGI-1	% solved	40.3 ± 1.8	o3-mini-high 34.5	+5.8
ARC-AGI-2	% solved	50.0 ± 2.0	Claude-3.7 21.2	+28.8
Sudoku-Extreme	% exact	74.5 ± 2.3	0 (CoT fails)	+74.5
Maze-Hard (30×30)	% optimal	55.0 ± 3.1	<20 (175 M TF)	+35–55

Compute:
Training: ~8 A100-hours for 1k examples.
Inference: 4–16 internal “segments”, each ~2× cost of a 27 M Transformer forward pass.

7. Deployment Considerations

Integration Paths:
Drop-in replacement for Transformer policy heads in RL pipelines.
Edge deployment: 27 M params fit on mobile GPUs (<120 MB fp16).
API design: expose “max_think_steps” knob to client.
Challenges:
Requires grid-structured or token-sequence I/O; not yet text-native.
Gradient approximation may diverge on very long sequences (>512 steps).
ACT Q-learning can be brittle without weight decay & post-norm.
User Experience:
Sub-second latency for Sudoku on RTX-4090 with 4 segments.
ARC tasks need 8-16 segments → ~2–4 s, still faster than multi-sample CoT.

8. Limitations & Assumptions

Tasks are fully observable, deterministic, symbolic.
1k examples assumption holds only when the task distribution is narrow.
No natural-language reasoning tested; unclear how hierarchy scales to ambiguity.
One-step gradient is an approximation; theoretical regret bound not provided.
Brain analogy is correlational (dimensionality PR), not causal.

9. Future Directions

Hybrid HRM-Transformer layers for language + logic (e.g., math word problems).
Incorporate hierarchical memory (Clockwork RNN style) for 100k-step horizons.
Meta-learning outer loop to set T, N, and curriculum automatically.
Formal verification of Turing-completeness under finite-precision.
Neuromorphic ASIC exploiting the low-memory, event-driven nature.

10. Conflicts & Biases

All authors are employees or affiliates of Sapient Intelligence (Singapore). No external funding or evaluation committee is disclosed. Benchmark selection (ARC, Sudoku, Maze) favors discrete symbolic tasks; broader NLP or safety-critical domains may reveal different relative strengths.