Nation-scale reading club

2024

Jonas Raaschou-Pedersen

2024-10-07

Paper

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
Mechanistic interpretability
- Understand neural networks by breaking them into more understandable components
- Understanding each component and how they interact
  - Reason about behavior of entire network?
- How to identify correct components to analyze?

Poly- and monosemanticity

Polysemantic neurons
- respond to mixtures of seemingly unrelated inputs
- how to interpret?
- need monosemantic units of analysis (e.g. features that respond to related inputs)
Superposition
- Neural network represents more features of the data than it has neurons
- Each feature assigned its own linear combination of neurons
- “Cause” of polysemanticity
- Hypothesized phenomenon (with increasing evidence)
Key question of paper

Can dictionary learning extract features that are significantly more monosemantic than neurons?

Superposition

Approach

Paper presents:
- a sparse autoencoder
- weak dictionary learning algorithm
- generates learned features from a trained one-layer transformer model
- yields more monosemantic unit of analysis than the model’s neurons
- monosemantic features are the hidden layer of the autoencoder
- activations of transformer are picked out and reconstructed by the autoencoder
- sparsisity induced by L1 regularization

Transformer / Autoencoder

Key idea

Hypothesis of superposition: NN represent more features than neurons
Can view each feature as a linear combination of neurons
- More features than neurons implies that features form overcomplete linear basis for activations of neurons
I.e. activations \(\mathbf{x}^{j}\) (from Transformer MLP) can be decomposed as:

\[\begin{align} \mathbf{x}^{j} \approx \mathbf{b} + \sum_{i} f_{i}(\mathbf{x}^{j})\mathbf{d}_{i} \end{align}\]

\(\mathbf{x}^{j} \in \mathbb{R}^{d_{\text{MLP}}}\) is the activation vector for datapoint \(j\) taken from the MLP layer
\(f_{i}(\mathbf{x}^{j})\) the activation of feature \(i\)
\(\mathbf{d}_{i}\) unit vector in activation space (direction of feature \(i\)) and \(\mathbf{b}\) bias term

One-layer transformer

The one-layer transformer used is an encoder
- “the simplest language model we profoundly don’t understand”
- Residual stream dimension of 128
- 512 neuron MLP layer
- ReLU activation function
- Trained on 100 billion tokens using the Adam optimizer

Autoencoder

The features are constructed by a sparse autoencoder trained on the MLP activations
Concretely, the autoencoder consists of:

\[\begin{align} \bar{\mathbf{x}} &= \mathbf{x} - \mathbf{b}_{d} \\ \mathbf{f} &= \mathrm{ReLU}(W_{e}\bar{\mathbf{x}} + b_{e}) \\ \hat{x} &= W_{d}\mathbf{f} + b_{d} \\ \mathcal{L} &= |X|^{-1}\sum_{\mathbf{x} \in X} \Vert \mathbf{x} - \hat{\mathbf{x}} \Vert^{2}_{2} + \lambda \Vert \mathbf{f} \Vert_{1} \end{align}\]

\(\mathbf{x} \in \mathbb{R}^{n}\) are the MLP activations; \(\mathbf{f} \in \mathbb{R}^{m}\) are the features
Encoder and decoder weights \(W_{e} \in \mathbb{R}^{m \times n}\), \(W_{d} \in \mathbb{R}^{n \times m}\); biases \(b_{e} \in \mathbb{R}^{m}\), \(b_{d} \in \mathbb{R}^{n}\)
\(n\) input and output dimension; \(m\) the autoencoder hidden dimension
- Case of \(m = 4096\) is the focus in the paper

Autoencoder code

Features the hidden layer of the autoencoder
Inputs are the MLP activations from one-layer transformer
See further reading and links for colab notebook with code replicating parts of the paper

Broad overview

One-layer transformer trained on The Pile dataset
- 800GB Dataset of Diverse Text
Autoencoder trained on MLP activations
- Focus on 4,096 features (A/1 run)
- 90 learned dictionaries in total
Dictionary is a mapping of tokens to feature space (A/1 run example): \[ \texttt{token} \mapsto \texttt{MLPActivations}(\texttt{embed}(\texttt{token})) =: \mathbf{x} \\ \mapsto \texttt{AutoEncoderHiddenLayer}(\mathbf{x}) \in \mathbb{R}^{4096}\]
Does tokens grouped in feature space make sence?
- E.g. map batch of tokens into feature space, consider e.g. dimension \(7\) of feature (dictionary) matrix and sort by activation scores
  - What do we see?

Results

Features from the autoencoder are more monosemantic than the neurons
Web interface showing all features
Feature strings A/1/3450 indicate
- the model (“A” or “B”)
- the dictionary learning run (e.g., “1”)
- the specific feature within that run (“3450”) (with varying L1 and \(m\) coefficients and

Web interface

Evaluating AutoEncoder

Manual inspection
- do features seem interpretable? (explore in web interface)
Feature density
- See next slide
Reconstruction loss
- Autoencoder reconstructs the MLP activations
- How well does the autoencoder do this?
- Compute loss using the reconstructed MLP activations
Toy models

Feature density

Each feature only activates on small number of tokens
Feature density := fraction of tokens on which the feature has a nonzero value
Concretely, pass batches of tokens through the pipeline, gets matrix \(\Phi\) of size \(b \cdot B \times m\) where \(b\) is the number of batches, \(B\) the batch size and \(m\) the encoder hidden layer dimension
Count non-zero in each column i.e. \(M \times 1\) dimensional vector of counts and plot as log histogram

Detailed investigation of features

Features studied:
- Arabic
- DNA
- Base64
- Hebrew
Goal is to establish claims for features:
- The learned feature activates with high specificity for the hypothesized context
- The learned feature activates with high sensitivity for the hypothesized context
- The learned feature causes appropriate downstream behavior
- The learned feature does not correspond to any neuron
- The learned feature is universal – a similar feature is found by dictionary learning applied to a different model

Log-likelihood proxy

Construct proxy of each feature
Proxy is log-likelihood ratio of string under feature hypothesis divided by full empirical distribution: \[\log\frac{P(s \mid \text{context})}{P(s)}\] for \(s\) a string
Intuition for choice of proxy:
- Since features linearly interact with logits they will be incentivized to track log-likelihoods

Feature activation distribution

Feature downstream effects

Learned features have interpretable causal effects on model outputs
Each feature when active makes some output tokens more likely (and some less)
Logit weights computed using the path expansion trick
Plot suggests feature makes it more likely that arabic tokens are predicted next

Feature ablations

Run the context (subsample interval) through the model until the MLP layer
Decode the activations into features
Subtract off the activation of A/1/3450; artificially setting it to zero on the whole context

Feature downstream effect

Sample from model starting with given context
Set feature A/1/3450 to its maximum observed value and observe change

Feature is not a neuron & universality

Feature A/1/3450 is not a neuron
- One neuron with Arabic script in top 20 dataset examples
- Coefficients of feature in neuron basis are all over the place
- Most correlated neuron to feature responds to mixture of non-English languages
  - Inspection activation and logit weight distributions
Universality
- A/1/3450 is a universal feature that forms in other models
- Can be consistenly discovered in other models
- Multiple runs lead to similar features
- Feature B/1/1334 (corr=0.91) strikingly similar

Interpretability

Manual Interpretability
Automated interpretability using Claude
Intervention on features and generate text

Phenomenology

Feature motifs
- context features (e.g. DNA, base64)
- token-in-context features
  - the in mathematics
    - A/0/341
  - < in HTML
    - A/0/20
Feature splitting
- Features appear in cluster
- Increasing dictionary size leads to more features; feature splitting

Feature splitting

Finite state automata

Formed by one feature increasing the probability of tokens, which in turn cause another feature to fire on the next step, and so on
Simplest example is feature which excite themselves on the next token; single node loop
Two features interact in loop to model “All Caps Snake Case” variable names
More complex automata shown in paper