logo
×

Almas AbdibayevJune 15, 2026

Where models change their minds: Identifying branchpoints for NLA training

Natural Language Autoencoders (NLAs), recently introduced by Anthropic, offer a more human-readable way to inspect what may be happening inside an LLM before its final answer is produced. Final outputs can hide the internal competition among possible next steps: a model may appear safe or aligned externally while representing several directions internally before suppressing most of them.

Earlier interpretability work often looked either at final behavior after the fact or at mechanistic circuits that are powerful but hard for most readers to inspect directly. NLAs occupy a different middle ground: they try to translate internal activations into natural-language descriptions, then reconstruct the activation from that description to test whether the verbal bottleneck preserves the relevant information. That makes them attractive for alignment work, but only if the decoded text tracks something real in the underlying computation rather than a generic story about the transcript.

The practical difficulty is that NLAs are layer-specific and expensive to train. Moreover, a checkpoint trained on one layer cannot be used on another. For example, the released Gemma 3 NLA checkpoint targets upstream layer 41. To get full coverage from every layer, you’d have to retrain a 27B model 63 times for all 63 layers. Also, training or decoding at the wrong layer can produce generic, off-distribution, or misleading language.

This post asks a narrow targeting question: for an open-source model, Gemma 3 27B, that sometimes takes shortcut-like paths on terminal tasks, can branchpoint fanouts tell us which hidden-state layer is worth training an NLA on? The branchpoint procedure is deliberately cheap: hold a task prefix fixed, sample many continuations, capture activations across all layers, and ask whether any layer separates legitimate continuations from shortcut-like ones strongly enough to justify the expensive NLA training step.

The motivating use case is to look near moments where the model could still continue in several directions. If a model recognizes a loophole, a verifier weakness, or a shortcut opportunity before it acts, that signal should be easier to study around branchpoints than after the final answer has collapsed the trajectory into one visible behavior. Thus, layer selection determines whether an NLA is pointed at a useful representation or at noise.

Early experiments suggested layer 61 might be a good target, but that picture fell apart under stronger controls. Once we accounted for task identity and other confounds, the signal became weak and spread out, and even carefully balanced setups only produced faint, non-localized effects.

The clearest takeaway ended up being behavioral instead: when both outcomes are still on the table, the key shift happens around the model’s reaction to the first terminal feedback, and changing that feedback can noticeably steer later shortcut-like behavior.

The experimental setting

We use Terminal Wrench as the task family for a Gemma 3 27B (we will refer to it as just Gemma from here on) terminal agent. Terminal Wrench is a curated set of 331 reward-hackable command-line environments drawn from public terminal-agent benchmarks (Terminal-Bench, Terminal-Bench 2.0, Terminal-Bench Pro, OpenThoughts-TB-dev, and SETA). The dataset is useful because many tasks admit both intended solutions and shortcut-like alternatives, such as hardcoding expected outputs, fabricating results, exploiting verifier assumptions, or editing verifier-facing files.

The release includes multiple trajectory types: 3,632 confirmed hack trajectories, 1,216 legitimate solves from an attacker agent, 1,441 no-reward attempts, and 2,352 non-adversarial baseline trajectories from successful pre-checks. It also defines a coarse exploit taxonomy, including hollow implementation, output spoofing, constraint loopholes, structural extraction, and binary hijacking. Those details matter less than the central property for this post: the environments are designed so that shortcut-like behavior is possible, but the Gemma agent still has to discover it from a particular task state. Throughout the blogpost we will refer to tasks by their dataset number, e.g. task 161.

The empirical bottleneck was consistently in the model behavior. Gemma does not reliably discover both paths from the same sampled branchpoint. In the full Terminal Wrench baseline, matched legitimate-versus-shortcut continuations from the same prefix were much rarer than the dataset’s hackability might suggest. Earlier, we tested Qwen3-family, as engineering controls for the Harbor/Terminus pipeline, but they did not produce the behavioral diversity needed for this analysis since they either solved things or just timed out, i.e. they did not cheat.

Terminology

A trajectory is one full agent run: model response, terminal output, next model response, and so on until the task ends. An episode is one model response plus the terminal feedback that follows it. A prefix is the part of a trajectory we replay before asking the model to continue again. The resulting task state is the branchpoint. A block is the basic unit of this analysis: one task at one branchpoint, with the same prompt, replayed prefix, terminal state, and sampling setup. You can think of it as a fixed task state from which we sample many futures. Internally, blocks are often named by task id plus branchpoint depth, such as task 981 at k=1. A continuation is one sampled future from that branchpoint. A result is within-block when legitimate and shortcut-like continuations are compared inside the same block, and pooled when continuations from multiple blocks are combined.

While pooled comparisons are useful for screening because they give more examples, they are also where confounding enters most easily. If one task mostly produces shortcuts and another mostly produces legitimate solves, a pooled probe can look strong by recognizing the task, the terminal state, or the style of a failure. Within-block comparisons are stricter because every continuation starts from the same replayed state.

Method: same prefix, many futures, all layers

The working hypothesis: if shortcut-like and legitimate continuations from the same prefix differ in activation space, then a simple probe should be able to separate them. The layer where that separation is strongest would become a candidate for NLA training.

  1. Identify a branchpoint: a task state that plausibly stands before a legitimate-versus-shortcut fork.
  2. Replay the prefix into a fresh sandbox and sample many continuations from that state.
  3. Judge each continuation as legitimate, shortcut_or_cheat, or failed_or_unclear.
  4. Replay selected continuations through Gemma and capture mean hidden-state activations across all layers over the response span.
  5. Stress-test any candidate layer with held-out tasks, shuffled-label baselines, surface controls, and cheap bottleneck probes before considering full NLA training.

Caveat: replay is approximate since Harbor does not expose byte-for-byte tmux/session continuation, so the pipeline re-executes deterministic shell prefixes in fresh sandboxes and adds branch context to the task prompt. While it is not identical to resuming the original terminal session, it’s good enough for exploratory branchpoint fanout. This means the workflow is best understood as a targeting method. It can surface candidate layers, identify obvious confounds, and decide whether expensive NLA training is justified.

What would count as success?

A run should start from branchpoints where each root has meaningful within-root behavioral diversity. 

  1. It should separate shortcut-like and legitimate continuations within blocks, as opposed to just after pooling across roots.
  2. It should beat shuffled-label baselines and survive dissociation controls where surface markers and behavior disagree.
  3. It should beat matched text and surface baselines, which merely recover transcript-visible artifacts.
  4. Off-the-shelf or lightly projected decodes should already produce non-generic, shortcut-adjacent hypotheses before full training.

Only after those gates are met should the expensive step be taken: training NLA and testing whether it faithfully exposes shortcut-relevant language under matched controls. 

Within-block separation. We compare shortcut-like and legitimate continuations that all start from that same task state. If $x_i$ is an activation vector and $y_i \in \{0, 1\}$ is the label, the probe is only allowed to use variation among continuations inside the same block, not between tasks. In practice we subtract each block's mean activation (at a specific layer) before fitting or scoring:

where $B(i)$ is the block containing example $i$. This removes the block centroid, so a probe cannot win just by recognizing which task state the activation came from.

Held-out-block evaluation. A probe should train on some blocks and be tested on different blocks. For a simple linear probe, we fit a direction $w$ that separates shortcut-like from legitimate activations in the training blocks, then score held-out examples with. We evaluate whether shortcut-like continuations receive higher scores than legitimate continuations using AUROC. AUROC=0.5 is chance; AUROC=1.0 is a perfect ranking.

Shuffled-label baseline. To check whether a score could arise by chance, we repeat the same training and evaluation many times after randomly shuffling the shortcut and legitimate labels within the same data. This gives a placebo distribution. A result only counts if the real AUROC is above the high end of this shuffled-label band.

Dissociation baseline. A probe must also survive cases where an obvious surface marker disagrees with the behavior. For example, if a command appears in every shortcut-labeled response, a probe might simply detect that string rather than the behavior. The test asks: when the same marker appears but the behavior is legitimate, does the probe call it legitimate? Or, when behavior is shortcut-like but the marker is absent, does the probe still find it? If not, the probe is reading the marker.

Text and surface baselines. Activations only matter if they beat simpler signals from the text by a meaningful margin. We train baselines on features such as bag-of-words text, character n-grams, response length, number of episodes, whether the response ends with task_complete, and other visible transcript statistics. 

Off-the-shelf or lightly projected decodes. Before training a full NLA, decoded text from an existing or lightly adapted decoder should already contain some concrete shortcut-relevant content. For example, it might mention skipped verification, fabricated outputs, hardcoded expected values, or shallow checker exploitation. It should also pass a vacuity control: if we pair decoded text with the wrong trajectory, performance should drop. 

Nonlinear controls. We also tried higher-capacity probes: MLPs, small transformers over layer sequences, and DeepSets over episode vectors. These ask whether linear probes we are using are too underpowered. They must also improve cleanly on held-out blocks, beat above baselines. In our runs, higher-capacity models mostly recovered post-action behavioral residue rather than a clean pre-action shortcut signal.

Results

We found that all candidate layers failed these gates. On the biggest clean task set, the within-block activation probe reached only about 0.56–0.60 AUROC. That clears shuffled-label controls, so the signal is probably not pure noise, but it remains below the 0.65 level we would want before treating it as a useful detector.

At the individual-block level, only one block with a properly set up verifier still showed a clear activation separation after the verifier audit, and that case beat its matched text-only baseline by only 0.0003, meaning visible transcript features explained almost the same signal. That is not enough margin to trust it as an internal signal rather than transcript-visible residue. The signal also failed the layer-selection goal: no layer separated from the rest of the stack, so neither layer 41, released by Anthropic, nor layer 61, nor any other layer had a privileged claim.

Finally, the released layer-41 NLA decoder did not recover shortcut intent from decoded text. A judge reading the decoded text scored near chance, and text decoded from the wrong trajectory worked about as well, which means the decoder was not carrying trial-specific shortcut information. These failures are why we stop short of full NLA training: the branchpoint method found weak structure, but not a layer-specific, text-independent signal worth verbalizing.

Dataset

Our first activation dataset pooled eight Terminal Wrench task roots—numeric IDs 1326, 1026, 1104, 134, 427, 842, and 958, plus the named puzzle task called rush-hour-bfs-solver (implement a Rush Hour BFS solver). This produced 421 judged branchpoint continuations with captured vectors. However, the labels in this set were strongly task identity-dependent and imbalanced in favor of either shortcuts or proper solutions.

We later replaced that smoke snapshot with eleven screened and carefully filtered task plus branchpoint blocks—only three more in count. As a result, the final dataset had 682 action action hint-stripped continuations. Each task-branchpoint combination came with more than 20 shortcuts and full solves and proper verifier controls, which made sure that an agent couldn't just cheat by inaction. This gave us much larger, honestly matched within-prefix comparisons compared to the first surface features confounded set.

First dataset and first NLA training candidate: Hidden-state slot 61

The first promising branchpoint came from Terminal Wrench task root 1326, which contained an early and plausible shortcut opportunity. Fanout from this point produced both shortcut-like and legitimate continuations, and the layer comparison consistently highlighted hidden-state slot 61 as the strongest separator in terms of raw activation difference. 

The candidate appeared late in the stack, where one might expect task-state and next-action information to be concentrated, and it appeared repeatedly under raw branchpoint comparisons.  But the broader analysis weakened the case for slot 61: across eight vector-complete roots and 421 continuations, the apparent separation did not hold up under stronger tests. The main problem was root confounding: labels were highly structured by task. Some roots were mostly shortcut-like, one was mostly legitimate, and root 1326 was mostly failures. A pooled direction would therefore learn task or root identity instead of shortcut intent.

Table 1. Early eight-root behavior snapshot. The original table is preserved as an embedded result image because it is useful evidence for root confounding; absolute counts should be read with the judge attached.

The first warning came from leave-one-out-task. We trained a linear probe on all tasks except one, then tested shortcut vs legitimate separation on the held-out task. Repeating this for each task checks whether the signal is memorizing tasks. It did not give us a reliable layer target: the strongest pooled scores were not enough to validate layer-61, and shuffled-label baselines matched the best pooled results.

We then tried a lightweight concept-bottleneck (mini-NLA) test instead of full NLA training. It mapped layer-61 activations into a sparse vocabulary of about 650 human-readable concepts such as hardcode, fabricate, bypass, verifier, test_file, and placeholder, then reconstructed the original activation from that bottleneck. Reconstruction worked well, so the layer was trainable through this compressed representation. But shortcut prediction did not transfer: held-out shortcut AUROC fell to about 0.52 even though full-data performance looked much higher. Taken together, the results say the same thing: layer 61 was learning task and root structure and surface regularities more than a stable shortcut concept that generalized to new tasks.

Table 2: Constrained-vocabulary mini-NLA probe for layer 60. The result is negative: the layer is trainable, but not a reliable held-out shortcut readout.

The problem with the dataset: diversity

The earlier layer tests were statistically underpowered because most branchpoints produced too few matched legitimate and shortcut-like continuations from the same prefix. With only a handful of minority-class examples, a real activation difference could easily be missed and a null result would mostly say that the sample was too small or too imbalanced, not that the signal was absent. Would better sampling produce cleaner same-prefix comparisons? We expanded the eight-root snapshot into a grid of 87 blocks across task and branchpoint depth, then pushed sampling to 50 continuations per block. Despite this expansion, we did not find a broad supply of balanced forks. Roughly 70% of blocks remained single-class on the legitimate versus shortcut axis at every branchpoint depth we tried.

Early branchpoints initially looked more diverse, but much of that was because they produced more failed or unclear continuations. When we excluded those undecided runs and compared only continuations the judge could classify as legitimate or shortcut-like, the shortcut rate did not change much with branchpoint depth. Deeper sampling mostly sharpened the majority behavior rather than turning one-sided blocks into balanced ones. Many blocks that showed zero shortcuts at 10 trials produced a small minority of shortcuts by 50, so they still did not become useful matched comparisons.

Changing decoding did not reliably solve the balance problem either. A native sampling screen across min-p grids, top-k 50, and high and low temperatures usually collapsed to single-class behavior or added failed/unclear continuations. 

Re-judging and targeting tasks gave the cleanest test dataset

What moved the needle was changing the way we judged the shortcut and legitimate solves. Instead of using the default Terminal Wrench gpt-4.1 judge we switched to a stronger model. Re-labeling the old 50-continuation fanout with gpt-5.5 made the balance problem look less severe: 29 of 87 task states now had at least 10 legitimate and 10 shortcut-like continuations. But they came from an older setup with known prompt and verifier concerns.

After filtering and re-running the promising cases under cleaner prompt construction, 11 task states still produced both classes. We then topped those up to at least 20 decided continuations per class, giving 682 continuations for the clean activation test. So balance was findable, but only through stronger judging, filtering, and targeted re-screening. Unfortunately, it was not something arbitrary branchpoints produced reliably. 

As we stated in the results section earlier, the experiments and controls on this cleaner set did not change the downstream activation conclusion: once task identity was controlled, the cleaner within-block probes stayed weak. The probe reached only about 0.56–0.60 AUROC across the cleaned dataset. It outperformed shuffled-label controls, so the signal is not pure noise, but it is weak. For NLA targeting, the non-localization is as important as the weak AUROC. If every layer is statistically tied with every other layer, then the experiment has not identified where to train. 

Branchpoint postmortem: Is there a point where trajectories calcify?

We found an unsatisfying answer to the one question: at what layer does the model commit? But another interesting question arose in its stead: at what point in the trajectory does the model commit? To test this, we used a freeze-and-resample setup. For a saved trajectory, we replayed the same task state up to a chosen point, then sampled many new continuations from that identical prefix. When we froze just after the model’s first action, tasks that could produce both outcomes were still mixed: the same prefix could still lead to legitimate or shortcut-like continuations. When we froze one step later, after the model had seen the first terminal response and reacted to it, the continuations became much more predictable from the prefix.

In other words, for Gemma 3 27B, the important fork often sits in the reaction to the first terminal feedback. One task made this especially concrete: an early which htop check found nothing, but the model later ran a broader find / -name htop, which surfaced an htop binary in the debug-shell environment. With that feedback in context, shortcut-like continuations rose to 59%, compared with 4% before that feedback appeared.

We then tested causality: whether the terminal feedback itself was affecting model decisions. For the same saved prefixes, we changed only the terminal output shown after the model’s first command. Placebo control: one synthetic version preserved the same facts in different wording. Experimental arm: the command plausible but flipped the evidence the output conveyed, for example turning “nothing useful happened” into a success-shaped result, or the reverse.

Changing the evidence moved later behavior much more than merely rewording the same evidence: shortcut-like rates shifted by about 29% on average under the counterfactual feedback, compared with about 12% under the same-content rewrite, and the difference was statistically reliable. The intervention also reduced actual verifier pass rate, from about 85% under the real feedback to about 59% under counterfactual feedback. So the readable terminal feedback is not just correlated with the later label, altering what the model believes happened changes what it does next. 

We changed the feedback shown to the model, not the container state, i.e. the files and the environment. Agents sometimes re-check the state and a few explicitly catch the inconsistency. The equivalent rewrite arm is null and verifier reward moves with the labels. We believe that this is the point where activation capture needs to happen.

Discussion

Activations look somewhat good at coarser levels. If we average continuations from a block, the resulting summary weakly predicts which blocks are shortcut-prone. But simple transcript features, such as response length and ending style, explain much of the same signal, and frozen versions did not transfer cleanly to fresh blocks. Late-trajectory activations can also read the outcome after the committing action. This distinction between block-level risk and per-continuation prediction should stay explicit. A tool that flags risky task states could still be useful for triage, dataset construction, or choosing where to run more expensive audits. 

In terms of other probes, the tested nonlinear models could recover final labels once they saw enough of the trajectory, but text and trivial surface features recovered much of the same information, and most of those episodes came after the model had already acted. 

The last test was the NLA itself. We took the released layer-41 decoder, and used an LLM judge to recover shortcut intent from the decoded text alone. It could not: AUROC was 0.547, with an interval spanning chance, and the decoded text failed a basic sanity check because text decoded from the wrong trial’s activations classified about as well. As a comparison, we also gave the visible trajectory transcript directly to the LLM judge. That did modestly better, at 0.681, meaning the transcript carried more shortcut signal than the decoded activations.

The released decoder targets layer 41, different from our early candidate layer 61, and NLA decoders do not transfer cleanly across layers.  But its failure to decode shortcut intent is still informative because the activation probes themselves did not support any clean localized target. If there is no robust pre-action signal in the tested activations, a natural-language decoder has little clean content to expose, regardless of how fluent its outputs are.

Limitations

First, shortcuts and lack thereof are model dependent. Stronger models in our testing consistently solve tasks cleanly, which makes surfacing deception much harder. Each model requires creating its own dataset through expensive parallel passes, and a proper study would require at least 3 models from different providers, and parameter count controls.

Second, “shortcut” is a behavioral umbrella label we define ourselves. It covers several different things: broken-verifier no-op passes, incomplete or weak-static work, fabricated artifacts, hardcoded expected outputs, and rare attempts to tamper with tests or verifiers. In 16 of the 54 branch environments we audited, an agent could receive full reward without taking any meaningful action, including the original task-161 steering example. Those cases have to be scoped as flawed-environment behavior, which forces us to exclude it from the evidence about robust shortcut behavior.

After removing that obvious grader failure mode, the remaining shortcut-labeled cases mostly looked like lazy-student task gaming: omissions, shallow checks, or fabricated artifacts that satisfied a weak grader. Direct tampering was rare and did not succeed in this first set of judge labels. The trajectory commitment experiments used only environments where a do-nothing agent received zero reward, so its timing result is not driven by free-reward graders. In that rough screen, about a third of Terminal Wrench inventory both passed the grader sanity check and still produced both outcomes for Gemma 3 27B.

Conclusions and future work

The branchpoint method was designed to identify where to focus NLA training for a tool that might eventually surface shortcut-relevant reasoning. However, cross-task controls, diversity bottlenecks, leakage checks, and direct NLA decoding all point to the same narrower conclusion: no specific layer is currently supported as a reliable shortcut-intent readout for Gemma 3 27B model. Thus, this blogpost highlights a negative result.

However, the methodology itself is useful. Branchpoint fanout plus all-layer activation capture is a good way to stress-test whether an NLA target is worth training. In addition, our most interesting observation is of behavioral nature. On tasks with strong verifiers where both outcomes remain reachable, the fork is usually open after episode 0 and becomes much more prefix-specific in the model's reaction to its first terminal feedback. The feedback-counterfactual result makes the mechanism more transparent: editing only that first feedback redirects later shortcut-like behavior across tasks, while equivalent rewrites do not. What did not survive is the stronger claim that each trajectory carries a universal pre-action disposition or that activations reveal a hidden intent code before the relevant behavior is visible. So, before asking any decoder to verbalize a representation, we should first verify that the representation itself contains a stable, matched, non-leaky signal.

As it is commonplace with interpretability work, the answers produced more questions, but the questions are more specific than where we started. What makes a task state steerable? Why does the weak same-prefix activation signal refuse to localize? Which parts of the feedback reaction are text, environment, model, or sampling setup? If you are exploring related questions or interested in collaborating, we would love to hear from you.