How do transformers perform computation? While the architecture is known, the mechanistic internal logic remains opaque. This essay explores the hypothesis that individual attention heads act as discrete gates in a larger computational circuit.
The Circuit Hypothesis
We treat the transformer’s residual stream as a communication bus. Each attention head reads from this bus, performs a transformation, and writes the result back.1 Residual stream as a “vector space bus” is a key insight from the Anthropic team.
“Individual attention heads are not monolithic feature detectors, but rather modular gates that selectively route information across the residual stream.”
An induction head is a specific pattern where a head attends to a previous token that was followed by the current token.2 These heads are responsible for the “grokking” phenomenon.
Mathematical Formalism
The attention operation can be written as a bilinear form over the embedding space:
By decomposing the matrix into its singular value decomposition (SVD), we can identify “feature directions” that the model is specifically looking for.
The Emergence of Logic
By analyzing the eigenvalues of these products across different layers, we can observe the “emergence” of specific grammatical logic during the training process. Induction heads typically emerge suddenly around 50% through training.