SAELens v6

SAELens 6.0.0 is live with changes to SAE training and loading. Check out the migration guide →

saes_pic

SAELens

The SAELens training codebase exists to help researchers:

Train sparse autoencoders.
Analyse sparse autoencoders and neural network internals.
Generate insights which make it easier to create safe and aligned AI systems.

Quick Start

Installation

pip install sae-lens

Loading Sparse Autoencoders from Huggingface

To load a pretrained sparse autoencoder, you can use SAE.from_pretrained() as below:

from sae_lens import SAE

sae = SAE.from_pretrained(
    release = "gemma-scope-2b-pt-res-canonical",
    sae_id = "layer_12/width_16k/canonical",
    device = "cuda"
)

You can see other importable SAEs on this page.

Any SAE on Huggingface that's trained using SAELens can also be loaded using SAE.from_pretrained(). In this case, release is the name of the Huggingface repo, and sae_id is the path to the SAE in the repo. You can see a list of SAEs listed on Huggingface with the saelens tag.

Loading Sparse Autoencoders from Disk

To load a pretrained sparse autoencoder from disk that you've trained yourself, you can use SAE.load_from_disk() as below.

from sae_lens import SAE

sae = SAE.load_from_disk("/path/to/your/sae", device="cuda")

Importing SAEs from other libraries

You can import an SAE created with another library by writing a custom PretrainedSaeHuggingfaceLoader or PretrainedSaeDiskLoader for use with SAE.from_pretrained() or SAE.load_from_disk(), respectively. See the pretrained_sae_loaders.py file for more details, or ask on the Open Source Mechanistic Interpretability Slack. If you write a good custom loader for another library, please consider contributing it back to SAELens!

Model Compatibility

SAELens SAEs are standard PyTorch modules and work with any framework or model architecture, not just TransformerLens. While SAELens provides deep integration with TransformerLens through HookedSAETransformer, you can use SAEs with:

Hugging Face Transformers: Extract activations using PyTorch hooks and pass them to sae.encode() / sae.decode()
nnsight: Use nnsight's tracing API for clean intervention patterns with SAE features
Any PyTorch model: SAEs work on any tensor with the correct input dimension

The core SAE methods (encode, decode, forward) accept standard PyTorch tensors, making them compatible with any activation extraction method.

import torch
from sae_lens import SAE

sae = SAE.from_pretrained(
    release="gemma-scope-2b-pt-res-canonical",
    sae_id="layer_12/width_16k/canonical",
    device="cuda"
)

# SAEs work with any activation tensor of the right shape
# Gemma 2 2B has d_model=2304
activations = torch.randn(1, 128, 2304, device="cuda")  # From any source
features = sae.encode(activations)
reconstructed = sae.decode(features)

For detailed examples including Hugging Face and nnsight integration, see the Usage Guide.

Background and further Readings

We highly recommend this tutorial.

For recent progress in SAEs, we recommend the LessWrong forum's Sparse Autoencoder tag

Tutorials

Below are some tutorials that show how to do some basic exploration of SAEs:

Loading and Analysing Pre-Trained Sparse Autoencoders
Understanding SAE Features with the Logit Lens
Training a Sparse Autoencoder

Example WandB Dashboard

WandB Dashboards provide lots of useful insights while training SAEs. Here's a screenshot from one training run.

screenshot

Citation

@misc{bloom2024saetrainingcodebase,
   title = {SAELens},
   author = {Bloom, Joseph and Tigges, Curt and Duong, Anthony and Chanin, David},
   year = {2024},
   howpublished = {\url{https://github.com/decoderesearch/SAELens}},
}}