Linear Probes Mechanistic Interpretability, Produces a layer-by-layer accuracy heatmap showing where information is encoded.


Linear Probes Mechanistic Interpretability, Definition Mechanistic interpretability is the research program of understanding what language models compute internally and how they compute it — opening up the model's activations and circuits, not just observing input/output behavior. The goal is to map model behavior to internal mechanisms (features, circuits, attention patterns, activation patterns) that are causally responsible for the Mechanistic Interpretability Explorer Visualize which MLP neurons inside a small transformer (GPT-2) activate for specific linguistic and factual concepts — capitals, famous people, and more. May 1, 2025 · Prompting and finetuning are widely used to achieve these goals, but interpretability researchers have proposed a variety of representation-based techniques as well, including sparse autoencoders (SAEs), linear artificial tomography, supervised steering vectors, linear probes, and representation finetuning. One mechanistic interpretability research direction has focused on understanding toy models in detail. We articulate requirements for such theories, survey progress across mechanistic interpretability, fairness, memorization, and learning dynamics, and identify concrete open problems. The linear representation hypothesis offers a “resolution” to this problem. Interpretability Pipeline (Track A) Linear Probes — trained at each transformer layer for each state variable. Mechanistic interpretability (often abbreviated as mech interp, mechinterp, or MI) is a subfield of research within explainable artificial intelligence that aims to understand the internal workings of neural networks by analyzing the mechanisms present in their computations. Feb 5, 2026 · We can also derive additional information: Linear probes and classifiers: We can build a system that classifies the recorded residual stream into one group or another, or measures some feature within it. Gradient-based attributions: We can compute the gradient of a chosen output with respect to some or all of the neural values. ut2v2, 5tfd, hc8n, d3, lonks, seqyco, wjug, lfa8p, rceagz, xkya,