Nation-scale reading club

2024

Jonas Raaschou-Pedersen

2024-10-07

Paper

Poly- and monosemanticity

  • Polysemantic neurons

    • respond to mixtures of seemingly unrelated inputs
    • how to interpret?
    • need monosemantic units of analysis (e.g. features that respond to related inputs)
  • Superposition

    • Neural network represents more features of the data than it has neurons
    • Each feature assigned its own linear combination of neurons
    • “Cause” of polysemanticity
    • Hypothesized phenomenon (with increasing evidence)
  • Key question of paper

    Can dictionary learning extract features that are significantly more monosemantic than neurons?

Superposition

Approach

  • Paper presents:
    • a sparse autoencoder 
    • weak dictionary learning algorithm
    • generates learned features from a trained one-layer transformer model
    • yields more monosemantic unit of analysis than the model’s neurons
    • monosemantic features are the hidden layer of the autoencoder
    • activations of transformer are picked out and reconstructed by the autoencoder
    • sparsisity induced by L1 regularization

Transformer / Autoencoder

Key idea

  • Hypothesis of superposition: NN represent more features than neurons
  • Can view each feature as a linear combination of neurons
    • More features than neurons implies that features form overcomplete linear basis for activations of neurons
  • I.e. activations \(\mathbf{x}^{j}\) (from Transformer MLP) can be decomposed as:

\[\begin{align} \mathbf{x}^{j} \approx \mathbf{b} + \sum_{i} f_{i}(\mathbf{x}^{j})\mathbf{d}_{i} \end{align}\]

  • \(\mathbf{x}^{j} \in \mathbb{R}^{d_{\text{MLP}}}\) is the activation vector for datapoint \(j\) taken from the MLP layer

  • \(f_{i}(\mathbf{x}^{j})\) the activation of feature \(i\)

  • \(\mathbf{d}_{i}\) unit vector in activation space (direction of feature \(i\)) and \(\mathbf{b}\) bias term

One-layer transformer

  • The one-layer transformer used is an encoder
    • “the simplest language model we profoundly don’t understand”
    • Residual stream dimension of 128
    • 512 neuron MLP layer
    • ReLU activation function
    • Trained on 100 billion tokens using the Adam optimizer

Autoencoder

  • The features are constructed by a sparse autoencoder trained on the MLP activations
  • Concretely, the autoencoder consists of:

\[\begin{align} \bar{\mathbf{x}} &= \mathbf{x} - \mathbf{b}_{d} \\ \mathbf{f} &= \mathrm{ReLU}(W_{e}\bar{\mathbf{x}} + b_{e}) \\ \hat{x} &= W_{d}\mathbf{f} + b_{d} \\ \mathcal{L} &= |X|^{-1}\sum_{\mathbf{x} \in X} \Vert \mathbf{x} - \hat{\mathbf{x}} \Vert^{2}_{2} + \lambda \Vert \mathbf{f} \Vert_{1} \end{align}\]

  • \(\mathbf{x} \in \mathbb{R}^{n}\) are the MLP activations; \(\mathbf{f} \in \mathbb{R}^{m}\) are the features
  • Encoder and decoder weights \(W_{e} \in \mathbb{R}^{m \times n}\), \(W_{d} \in \mathbb{R}^{n \times m}\); biases \(b_{e} \in \mathbb{R}^{m}\), \(b_{d} \in \mathbb{R}^{n}\)
  • \(n\) input and output dimension; \(m\) the autoencoder hidden dimension
    • Case of \(m = 4096\) is the focus in the paper

Autoencoder code

  • Features the hidden layer of the autoencoder
  • Inputs are the MLP activations from one-layer transformer
  • See further reading and links for colab notebook with code replicating parts of the paper

Broad overview

  • One-layer transformer trained on The Pile dataset
    • 800GB Dataset of Diverse Text
  • Autoencoder trained on MLP activations
    • Focus on 4,096 features (A/1 run)
    • 90 learned dictionaries in total
  • Dictionary is a mapping of tokens to feature space (A/1 run example): \[ \texttt{token} \mapsto \texttt{MLPActivations}(\texttt{embed}(\texttt{token})) =: \mathbf{x} \\ \mapsto \texttt{AutoEncoderHiddenLayer}(\mathbf{x}) \in \mathbb{R}^{4096}\]
  • Does tokens grouped in feature space make sence?
    • E.g. map batch of tokens into feature space, consider e.g. dimension \(7\) of feature (dictionary) matrix and sort by activation scores
      • What do we see?

Results

  • Features from the autoencoder are more monosemantic than the neurons
  • Web interface showing all features
  • Feature strings A/1/3450 indicate
    • the model (“A” or “B”)
    • the dictionary learning run (e.g., “1”)
    • the specific feature within that run (“3450”) (with varying L1 and \(m\) coefficients and

Web interface

Evaluating AutoEncoder

  • Manual inspection
    • do features seem interpretable? (explore in web interface)
  • Feature density
    • See next slide
  • Reconstruction loss
    • Autoencoder reconstructs the MLP activations
    • How well does the autoencoder do this?
    • Compute loss using the reconstructed MLP activations
  • Toy models

Feature density

  • Each feature only activates on small number of tokens
  • Feature density := fraction of tokens on which the feature has a nonzero value
  • Concretely, pass batches of tokens through the pipeline, gets matrix \(\Phi\) of size \(b \cdot B \times m\) where \(b\) is the number of batches, \(B\) the batch size and \(m\) the encoder hidden layer dimension
  • Count non-zero in each column i.e. \(M \times 1\) dimensional vector of counts and plot as log histogram

Detailed investigation of features

  • Features studied:
    • Arabic
    • DNA
    • Base64
    • Hebrew
  • Goal is to establish claims for features:
    • The learned feature activates with high specificity for the hypothesized context
    • The learned feature activates with high sensitivity for the hypothesized context
    • The learned feature causes appropriate downstream behavior
    • The learned feature does not correspond to any neuron
    • The learned feature is universal – a similar feature is found by dictionary learning applied to a different model

Log-likelihood proxy

  • Construct proxy of each feature
  • Proxy is log-likelihood ratio of string under feature hypothesis divided by full empirical distribution: \[\log\frac{P(s \mid \text{context})}{P(s)}\] for \(s\) a string
  • Intuition for choice of proxy:
    • Since features linearly interact with logits they will be incentivized to track log-likelihoods

Feature activation distribution

Feature downstream effects

  • Learned features have interpretable causal effects on model outputs
  • Each feature when active makes some output tokens more likely (and some less)
  • Logit weights computed using the path expansion trick
  • Plot suggests feature makes it more likely that arabic tokens are predicted next

Feature ablations

  • Run the context (subsample interval) through the model until the MLP layer
  • Decode the activations into features
  • Subtract off the activation of A/1/3450; artificially setting it to zero on the whole context

Feature downstream effect

  • Sample from model starting with given context
  • Set feature A/1/3450 to its maximum observed value and observe change

Feature is not a neuron & universality

  • Feature A/1/3450 is not a neuron
    • One neuron with Arabic script in top 20 dataset examples
    • Coefficients of feature in neuron basis are all over the place
    • Most correlated neuron to feature responds to mixture of non-English languages
      • Inspection activation and logit weight distributions
  • Universality
    • A/1/3450 is a universal feature that forms in other models
    • Can be consistenly discovered in other models
    • Multiple runs lead to similar features
    • Feature B/1/1334 (corr=0.91) strikingly similar

Interpretability

  • Manual Interpretability
  • Automated interpretability using Claude
  • Intervention on features and generate text

Phenomenology

  • Feature motifs
    • context features (e.g. DNA, base64)
    • token-in-context features 
  • Feature splitting
    • Features appear in cluster
    • Increasing dictionary size leads to more features; feature splitting

Feature splitting

Finite state automata

  • Formed by one feature increasing the probability of tokens, which in turn cause another feature to fire on the next step, and so on
  • Simplest example is feature which excite themselves on the next token; single node loop
  • Two features interact in loop to model “All Caps Snake Case” variable names
  • More complex automata shown in paper