dybilar

2506.21734 HIERARCHICAL REASONING MODEL

paper

hrm

Note to self dyb:

Strip away the prose and HRM boils down to:

Two reusable “loops” (L1 & L2) • L1 = fast, local, resets every T steps • L2 = slow, global, updates once per T steps, then broadcasts a new context to L1

A tiny data recipe • ~1 000 (input, output) pairs per task—no pre-training, no human scratchpads.

One cheap training trick • Back-propagate only through the last state of each loop → constant memory, no BPTT.

Optional “think longer” switch • Q-learning head decides when to stop (ACT), so harder puzzles get more L1-L2 cycles at inference.

Everything else (brain analogy, dimensionality stories, etc.) is supporting narrative.

Five-Lens TL;DR

  • Expert: A 27 M-parameter recurrent architecture that replaces Chain-of-Thought with two nested recurrence loops (fast low-level + slow high-level). It eliminates BPTT, attains depth via hierarchical convergence, and solves ARC-AGI, Sudoku-Extreme, and 30×30 maze search at 40–80 % with only ~1,000 examples—no pre-training, no CoT.
  • Practitioner: Ships as a compact PyTorch module (1-step gradient, constant memory) that you can fine-tune on 1k examples in <2 GPU-days. Inference scales linearly with compute budget; add more ‘thinking steps’ for harder puzzles without re-training.
  • General Public: Imagine a tiny AI that can finish expert-level Sudoku or draw the shortest path through a giant maze after seeing only 1,000 practice puzzles—no need for the huge models or ‘thinking out loud’ tricks used by ChatGPT.
  • Skeptic: Impressive, but only tested on synthetic puzzles. No evidence on open-ended language tasks; 1k examples may still be too many for real-world adoption; and the brain analogy is suggestive, not proven.
  • Decision-Maker: Low-data, low-parameter path to high-stakes planning tasks (logistics, chip layout, verification). If verified on industrial problems, could cut cloud-inference cost 10–100× versus frontier LLMs.

2. What Real-World Problem Is This Solving?

Current LLMs need: - Large pre-training corpora (terabytes) - Explicit Chain-of-Thought supervision (human-written scratchpads) - High token budgets (latency & cost)

HRM attacks the “data-hungry, token-hungry, compute-hungry” triad for algorithmic reasoning tasks where the answer must be exact (e.g., Sudoku grids, shortest path, ARC pattern completion).

3. Counterintuitive Take-Aways

  • Smaller is deeper: 27 M params beats 175 M-1 T param LLMs on tree-search tasks.
  • CoT is optional, not mandatory—latent hidden-state reasoning suffices.
  • Biological inspiration (theta-gamma coupling) yields better engineering metrics than pure scaling.

4. Jargon-to-Plain Dictionary

  • Hierarchical Reasoning Model (HRM): “A mini-brain with two gears: a slow planner (high-level) and a fast executor (low-level) that talk every few milliseconds.”
  • Hierarchical convergence: “Let the fast gear finish its mini-task, then reset it with new instructions; repeat until the slow gear is satisfied.”
  • 1-step gradient approximation: “Learn by looking only at the final move, not replaying every step—like learning chess by checking the end position.”
  • Deep supervision: “Grade the student after every short exam, not only after the whole course.”
  • Adaptive Computation Time (ACT): “Hard puzzle? Think longer; easy puzzle? stop early—automatically.”

5. Methodology Highlights

Architecture - Two encoder-only Transformer blocks (identical size) arranged as: - Low-level (L) updates every step; High-level (H) updates every T steps (T≈4–8). - Input/Output: sequence-to-sequence, flattened 2-D grids. - Post-Norm, RMSNorm, RoPE, GLU—modern LLM best-practices.

Training - Random initialization; no pre-training. - 1k examples per task (ARC, Sudoku-Extreme, Maze-Hard). - Loss: cross-entropy + binary cross-entropy for ACT Q-values. - Optimizer: Adam-atan2, constant LR, weight decay. - Memory: O(1) backprop via 1-step gradient approximation inspired by Deep Equilibrium Models (IFT + Neumann truncation).

Inference - Halting controller (Q-head) chooses when to stop. - Extra compute budget can be added at test-time without re-training.

6. Quantified Results (±95 % bootstrap CIs where available)

Task Metric HRM (27 M) Best CoT LLM Δ
ARC-AGI-1 % solved 40.3 ± 1.8 o3-mini-high 34.5 +5.8
ARC-AGI-2 % solved 50.0 ± 2.0 Claude-3.7 21.2 +28.8
Sudoku-Extreme % exact 74.5 ± 2.3 0 (CoT fails) +74.5
Maze-Hard (30×30) % optimal 55.0 ± 3.1 <20 (175 M TF) +35–55
  • Compute:
  • Training: ~8 A100-hours for 1k examples.
  • Inference: 4–16 internal “segments”, each ~2× cost of a 27 M Transformer forward pass.

7. Deployment Considerations

  • Integration Paths:
  • Drop-in replacement for Transformer policy heads in RL pipelines.
  • Edge deployment: 27 M params fit on mobile GPUs (<120 MB fp16).
  • API design: expose “max_think_steps” knob to client.
  • Challenges:
  • Requires grid-structured or token-sequence I/O; not yet text-native.
  • Gradient approximation may diverge on very long sequences (>512 steps).
  • ACT Q-learning can be brittle without weight decay & post-norm.
  • User Experience:
  • Sub-second latency for Sudoku on RTX-4090 with 4 segments.
  • ARC tasks need 8-16 segments → ~2–4 s, still faster than multi-sample CoT.

8. Limitations & Assumptions

  • Tasks are fully observable, deterministic, symbolic.
  • 1k examples assumption holds only when the task distribution is narrow.
  • No natural-language reasoning tested; unclear how hierarchy scales to ambiguity.
  • One-step gradient is an approximation; theoretical regret bound not provided.
  • Brain analogy is correlational (dimensionality PR), not causal.

9. Future Directions

  • Hybrid HRM-Transformer layers for language + logic (e.g., math word problems).
  • Incorporate hierarchical memory (Clockwork RNN style) for 100k-step horizons.
  • Meta-learning outer loop to set T, N, and curriculum automatically.
  • Formal verification of Turing-completeness under finite-precision.
  • Neuromorphic ASIC exploiting the low-memory, event-driven nature.

10. Conflicts & Biases

  • All authors are employees or affiliates of Sapient Intelligence (Singapore). No external funding or evaluation committee is disclosed. Benchmark selection (ARC, Sudoku, Maze) favors discrete symbolic tasks; broader NLP or safety-critical domains may reveal different relative strengths.