Nation-scale reading club
2024
2024-10-07
Poly- and monosemanticity
Polysemantic neurons
- respond to mixtures of seemingly unrelated inputs
- how to interpret?
- need monosemantic units of analysis (e.g. features that respond to related inputs)
Superposition
- Neural network represents more features of the data than it has neurons
- Each feature assigned its own linear combination of neurons
- “Cause” of polysemanticity
- Hypothesized phenomenon (with increasing evidence)
Key question of paper
Can dictionary learning extract features that are significantly more monosemantic than neurons?
Superposition
Approach
- Paper presents:
- a sparse autoencoder
- weak dictionary learning algorithm
- generates learned features from a trained one-layer transformer model
- yields more monosemantic unit of analysis than the model’s neurons
- monosemantic features are the hidden layer of the autoencoder
- activations of transformer are picked out and reconstructed by the autoencoder
- sparsisity induced by L1 regularization
Key idea
- Hypothesis of superposition: NN represent more features than neurons
- Can view each feature as a linear combination of neurons
- More features than neurons implies that features form overcomplete linear basis for activations of neurons
- I.e. activations \(\mathbf{x}^{j}\) (from Transformer MLP) can be decomposed as:
\[\begin{align}
\mathbf{x}^{j} \approx \mathbf{b} + \sum_{i}
f_{i}(\mathbf{x}^{j})\mathbf{d}_{i}
\end{align}\]
\(\mathbf{x}^{j} \in \mathbb{R}^{d_{\text{MLP}}}\) is the activation vector for datapoint \(j\) taken from the MLP layer
\(f_{i}(\mathbf{x}^{j})\) the activation of feature \(i\)
\(\mathbf{d}_{i}\) unit vector in activation space (direction of feature \(i\)) and \(\mathbf{b}\) bias term
Autoencoder
- The features are constructed by a sparse autoencoder trained on the MLP activations
- Concretely, the autoencoder consists of:
\[\begin{align}
\bar{\mathbf{x}} &= \mathbf{x} - \mathbf{b}_{d} \\
\mathbf{f} &= \mathrm{ReLU}(W_{e}\bar{\mathbf{x}} + b_{e}) \\
\hat{x} &= W_{d}\mathbf{f} + b_{d} \\
\mathcal{L} &= |X|^{-1}\sum_{\mathbf{x} \in X} \Vert \mathbf{x} - \hat{\mathbf{x}} \Vert^{2}_{2}
+ \lambda \Vert \mathbf{f} \Vert_{1}
\end{align}\]
- \(\mathbf{x} \in \mathbb{R}^{n}\) are the MLP activations; \(\mathbf{f} \in \mathbb{R}^{m}\) are the features
- Encoder and decoder weights \(W_{e} \in \mathbb{R}^{m \times n}\), \(W_{d} \in
\mathbb{R}^{n \times m}\); biases \(b_{e} \in \mathbb{R}^{m}\), \(b_{d} \in
\mathbb{R}^{n}\)
- \(n\) input and output dimension; \(m\) the autoencoder hidden dimension
- Case of \(m = 4096\) is the focus in the paper
Autoencoder code
- Features the hidden layer of the autoencoder
- Inputs are the MLP activations from one-layer transformer
- See further reading and links for colab notebook with code replicating parts of the paper
Broad overview
- One-layer transformer trained on The Pile dataset
- 800GB Dataset of Diverse Text
- Autoencoder trained on MLP activations
- Focus on 4,096 features (
A/1
run)
- 90 learned dictionaries in total
- Dictionary is a mapping of tokens to feature space (
A/1
run example): \[
\texttt{token}
\mapsto \texttt{MLPActivations}(\texttt{embed}(\texttt{token})) =: \mathbf{x}
\\
\mapsto \texttt{AutoEncoderHiddenLayer}(\mathbf{x})
\in \mathbb{R}^{4096}\]
- Does tokens grouped in feature space make sence?
- E.g. map batch of tokens into feature space, consider e.g. dimension \(7\) of feature (dictionary) matrix and sort by activation scores
Results
- Features from the autoencoder are more monosemantic than the neurons
- Web interface showing all features
- Feature strings
A/1/3450
indicate
- the model (“A” or “B”)
- the dictionary learning run (e.g., “1”)
- the specific feature within that run (“3450”) (with varying L1 and \(m\) coefficients and
Web interface
Evaluating AutoEncoder
- Manual inspection
- do features seem interpretable? (explore in web interface)
- Feature density
- Reconstruction loss
- Autoencoder reconstructs the MLP activations
- How well does the autoencoder do this?
- Compute loss using the reconstructed MLP activations
- Toy models
Feature density
- Each feature only activates on small number of tokens
- Feature density := fraction of tokens on which the feature has a nonzero value
- Concretely, pass batches of tokens through the pipeline, gets matrix \(\Phi\) of size \(b \cdot B \times m\) where \(b\) is the number of batches, \(B\) the batch size and \(m\) the encoder hidden layer dimension
- Count non-zero in each column i.e. \(M \times 1\) dimensional vector of counts and plot as log histogram
Detailed investigation of features
- Features studied:
- Goal is to establish claims for features:
- The learned feature activates with high specificity for the hypothesized context
- The learned feature activates with high sensitivity for the hypothesized context
- The learned feature causes appropriate downstream behavior
- The learned feature does not correspond to any neuron
- The learned feature is universal – a similar feature is found by dictionary learning applied to a different model
Log-likelihood proxy
- Construct proxy of each feature
- Proxy is log-likelihood ratio of string under feature hypothesis divided by full empirical distribution: \[\log\frac{P(s \mid \text{context})}{P(s)}\] for \(s\) a string
- Intuition for choice of proxy:
- Since features linearly interact with logits they will be incentivized to track log-likelihoods
Feature activation distribution
Feature downstream effects
- Learned features have interpretable causal effects on model outputs
- Each feature when active makes some output tokens more likely (and some less)
- Logit weights computed using the path expansion trick
- Plot suggests feature makes it more likely that arabic tokens are predicted next
Feature ablations
- Run the context (subsample interval) through the model until the MLP layer
- Decode the activations into features
- Subtract off the activation of A/1/3450; artificially setting it to zero on the whole context
Feature downstream effect
- Sample from model starting with given context
- Set feature A/1/3450 to its maximum observed value and observe change
Feature is not a neuron & universality
- Feature
A/1/3450
is not a neuron
- One neuron with Arabic script in top 20 dataset examples
- Coefficients of feature in neuron basis are all over the place
- Most correlated neuron to feature responds to mixture of non-English languages
- Inspection activation and logit weight distributions
- Universality
A/1/3450
is a universal feature that forms in other models
- Can be consistenly discovered in other models
- Multiple runs lead to similar features
- Feature
B/1/1334
(corr=0.91) strikingly similar
Interpretability
- Manual Interpretability
- Automated interpretability using Claude
- Intervention on features and generate text
Phenomenology
- Feature motifs
- context features (e.g. DNA, base64)
- token-in-context features
the
in mathematics
<
in HTML
- Feature splitting
- Features appear in cluster
- Increasing dictionary size leads to more features; feature splitting
Feature splitting
Finite state automata
- Formed by one feature increasing the probability of tokens, which in turn cause another feature to fire on the next step, and so on
- Simplest example is feature which excite themselves on the next token; single node loop
- Two features interact in loop to model “All Caps Snake Case” variable names
- More complex automata shown in paper
Further reading and links