Pico

A Lightweight Framework for Studying Learning Dynamics

Pico is a lightweight research framework that demystifies how language models learn. Built with simplicity in mind, it provides an efficient way to train and study models of different sizes.

Our Philosophy

Pico makes the process of designing model architectures and pretraining paradigms less of an art and more of a science. By understanding how models learn, we can inform and improve the way we build and train them.

Pico is a family of decoder models, all trained identically with scale as the only difference, accompanied by rich training checkpoints containing activations, gradients, and more for interpretability research. It also provides a streamlined codebase for training and analyzing your own model suite.

Small-Scale Focus

Train and study models from 1M to 1B parameters, making experimentation with training paradigms practical and accessible.

Advanced Checkpointing

Access model activations, gradients, and other rich information throughout training for mechanistic interpretability research.

Easy Retraining

Simple, modular codebase designed for researchers to modify and retrain the entire model suite with custom training paradigms.

PyTorch Lightning

Built on PyTorch Lightning for efficient, scalable training with minimal boilerplate code.

Minimal Dependencies

Lightweight framework with only essential dependencies, making it easy to install and modify.

Research Ready

Designed with researchers in mind, providing tools and flexibility needed for academic exploration.

Model Suite

Our model suite provides a controlled environment where the only variable is model size. This allows researchers to isolate the effects of model scale on learning dynamics, offering insights into how different architectures perform under identical conditions.

* All models are trained on 420B tokens of the preprocessed OLMO dataset, which has been pre-tokenized, pre-shuffled, and uploaded to Hugging Face.

Learning Dynamics

The Pico codebase provides comprehensive tooling to capture and analyze training metrics, enabling researchers to understand how models learn across different scales.

  • Convergence Rates

    Compute layer convergence rates across model sizes using automatically stored activation checkpoints.

  • Effective Dimensions

    Analyze the effective rank of layers throughout training by leveraging pre-computed residual stream activations.

  • Gradient Magnitude

    Track how gradient magnitudes evolve during training to understand optimization dynamics and identify potential training instabilities.

Built with ❤ by the Pico team