Note to self dyb:
Strip away the prose and HRM boils down to:
Two reusable “loops” (L1 & L2) • L1 = fast, local, resets every T steps • L2 = slow, global, updates once per T steps, then broadcasts a new context to L1
A tiny data recipe • ~1 000 (input, output) pairs per task—no pre-training, no human scratchpads.
One cheap training trick • Back-propagate only through the last state of each loop → constant memory, no BPTT.
Optional “think longer” switch • Q-learning head decides when to stop (ACT), so harder puzzles get more L1-L2 cycles at inference.
Everything else (brain analogy, dimensionality stories, etc.) is supporting narrative.
Five-Lens TL;DR
- Expert: A 27 M-parameter recurrent architecture that replaces Chain-of-Thought with two nested recurrence loops (fast low-level + slow high-level). It eliminates BPTT, attains depth via hierarchical convergence, and solves ARC-AGI, Sudoku-Extreme, and 30×30 maze search at 40–80 % with only ~1,000 examples—no pre-training, no CoT.
- Practitioner: Ships as a compact PyTorch module (1-step gradient, constant memory) that you can fine-tune on 1k examples in <2 GPU-days. Inference scales linearly with compute budget; add more ‘thinking steps’ for harder puzzles without re-training.
- General Public: Imagine a tiny AI that can finish expert-level Sudoku or draw the shortest path through a giant maze after seeing only 1,000 practice puzzles—no need for the huge models or ‘thinking out loud’ tricks used by ChatGPT.
- Skeptic: Impressive, but only tested on synthetic puzzles. No evidence on open-ended language tasks; 1k examples may still be too many for real-world adoption; and the brain analogy is suggestive, not proven.
- Decision-Maker: Low-data, low-parameter path to high-stakes planning tasks (logistics, chip layout, verification). If verified on industrial problems, could cut cloud-inference cost 10–100× versus frontier LLMs.
2. What Real-World Problem Is This Solving?
Current LLMs need: - Large pre-training corpora (terabytes) - Explicit Chain-of-Thought supervision (human-written scratchpads) - High token budgets (latency & cost)
HRM attacks the “data-hungry, token-hungry, compute-hungry” triad for algorithmic reasoning tasks where the answer must be exact (e.g., Sudoku grids, shortest path, ARC pattern completion).
3. Counterintuitive Take-Aways
- Smaller is deeper: 27 M params beats 175 M-1 T param LLMs on tree-search tasks.
- CoT is optional, not mandatory—latent hidden-state reasoning suffices.
- Biological inspiration (theta-gamma coupling) yields better engineering metrics than pure scaling.
4. Jargon-to-Plain Dictionary
- Hierarchical Reasoning Model (HRM): “A mini-brain with two gears: a slow planner (high-level) and a fast executor (low-level) that talk every few milliseconds.”
- Hierarchical convergence: “Let the fast gear finish its mini-task, then reset it with new instructions; repeat until the slow gear is satisfied.”
- 1-step gradient approximation: “Learn by looking only at the final move, not replaying every step—like learning chess by checking the end position.”
- Deep supervision: “Grade the student after every short exam, not only after the whole course.”
- Adaptive Computation Time (ACT): “Hard puzzle? Think longer; easy puzzle? stop early—automatically.”
5. Methodology Highlights
Architecture - Two encoder-only Transformer blocks (identical size) arranged as: - Low-level (L) updates every step; High-level (H) updates every T steps (T≈4–8). - Input/Output: sequence-to-sequence, flattened 2-D grids. - Post-Norm, RMSNorm, RoPE, GLU—modern LLM best-practices.
Training - Random initialization; no pre-training. - 1k examples per task (ARC, Sudoku-Extreme, Maze-Hard). - Loss: cross-entropy + binary cross-entropy for ACT Q-values. - Optimizer: Adam-atan2, constant LR, weight decay. - Memory: O(1) backprop via 1-step gradient approximation inspired by Deep Equilibrium Models (IFT + Neumann truncation).
Inference - Halting controller (Q-head) chooses when to stop. - Extra compute budget can be added at test-time without re-training.
6. Quantified Results (±95 % bootstrap CIs where available)
Task | Metric | HRM (27 M) | Best CoT LLM | Δ |
---|---|---|---|---|
ARC-AGI-1 | % solved | 40.3 ± 1.8 | o3-mini-high 34.5 | +5.8 |
ARC-AGI-2 | % solved | 50.0 ± 2.0 | Claude-3.7 21.2 | +28.8 |
Sudoku-Extreme | % exact | 74.5 ± 2.3 | 0 (CoT fails) | +74.5 |
Maze-Hard (30×30) | % optimal | 55.0 ± 3.1 | <20 (175 M TF) | +35–55 |
- Compute:
- Training: ~8 A100-hours for 1k examples.
- Inference: 4–16 internal “segments”, each ~2× cost of a 27 M Transformer forward pass.
7. Deployment Considerations
- Integration Paths:
- Drop-in replacement for Transformer policy heads in RL pipelines.
- Edge deployment: 27 M params fit on mobile GPUs (<120 MB fp16).
- API design: expose “max_think_steps” knob to client.
- Challenges:
- Requires grid-structured or token-sequence I/O; not yet text-native.
- Gradient approximation may diverge on very long sequences (>512 steps).
- ACT Q-learning can be brittle without weight decay & post-norm.
- User Experience:
- Sub-second latency for Sudoku on RTX-4090 with 4 segments.
- ARC tasks need 8-16 segments → ~2–4 s, still faster than multi-sample CoT.
8. Limitations & Assumptions
- Tasks are fully observable, deterministic, symbolic.
- 1k examples assumption holds only when the task distribution is narrow.
- No natural-language reasoning tested; unclear how hierarchy scales to ambiguity.
- One-step gradient is an approximation; theoretical regret bound not provided.
- Brain analogy is correlational (dimensionality PR), not causal.
9. Future Directions
- Hybrid HRM-Transformer layers for language + logic (e.g., math word problems).
- Incorporate hierarchical memory (Clockwork RNN style) for 100k-step horizons.
- Meta-learning outer loop to set T, N, and curriculum automatically.
- Formal verification of Turing-completeness under finite-precision.
- Neuromorphic ASIC exploiting the low-memory, event-driven nature.
10. Conflicts & Biases
- All authors are employees or affiliates of Sapient Intelligence (Singapore). No external funding or evaluation committee is disclosed. Benchmark selection (ARC, Sudoku, Maze) favors discrete symbolic tasks; broader NLP or safety-critical domains may reveal different relative strengths.