This guide will help you get started with pico-train, a framework for training language models with a focus on simplicity and reproducibility.
Overview
Introduction
Pico is defined as something "very small." Scientifically, a picometer is a trillionth of a meter.
As language models have grown in size, the benchmark for what a small language model is has continually increased. With large language models now containing trillions of parameters, a model with "only" a few billion parameters is often considered small. This trend toward larger models does not appear to be going away anytime soon. One major difficulty is that the larger the model, the harder it becomes to determine what works well and what doesn't—it is challenging to be rigorous in model design and training paradigms. Because re-training models is expensive, we often cannot afford comprehensive ablation studies.
One promising step toward solving this problem is releasing language models along with all of their training checkpoints. Rather than merely sharing the final model weights, some model suites, like Olmo and Pythia, have begun providing comprehensive intermediate checkpoints. These frameworks, however, still do not go far enough. What we need is a more complete solution that builds on this concept.
If we want to understand how language models learn—and how language models of different sizes learn—we need a system for training them ourselves. We need a framework—almost a laboratory—that lets us tweak a language model, apply that change to models of various sizes, and observe how it affects learning outcomes.
Doing this kind of work is inherently difficult if a "small" model still contains several billion parameters. Training models on that scale for multiple configurations is simply not very feasible or cost-effective. This is why Pico aims to facilitate experimentation at more manageable scales, allowing us to systematically study language model behavior and, ultimately, improve the design, training, and optimization of larger models.
Design Philosophy
pico-train is a package for training language models. Our goal is not to develop a model training architecture, or a highly-optimized way of training models; but rather a simple and educational way of training language models cost-effectively and reproducibly. Ultimately, we believe it should feel trivial to build your own model suite like Pythia or Olmo using pico-train.
Many language models are often written and trained using libraries that obfuscate the model definition and training code. Just finding where in the code the optimizer.step() happens can be difficult. Our guiding principle with pico-train was to make users - especially earlier AI practitioners - able to understand the model architecture and training pipeline as quickly as possible. We offer a fully money-back guarantee; if you are not able to get up and started with pico-train in less than 15 minutes — this is an easy guarantee to uphold since Pico is entirely open-source and free.
pico-train is meant to be a 'maximally forkable' project. While other libraries try to do everything always, we try to implement only the essentials in a straightforward, unopinionated manner. We believe that if we provide a solid foundation for experimentation, users should be able to easily fork our project and begin tinkering on their own quickly.
Code Structure
Code Walkthrough
The code for pico-train is largely under the src directory. Let's explore the structure together:
Pico Model
Out of the box, pico-train implements a Llama-style decoder; which we call pico-decoder. In the future we hope to expand out the number of models that we package with pico-train to more than just a decoder; a pico-diffusion, and pico-statespace models are in the works. For something to be called a pico-* model it must be a simple, well document and universal base version of the given model.
The pico-decoder model architecture implements many of the expected features of modern-day performant auto-regressive language models. It includes:
- Stacked layers of Grouped Query Attention (GQA): Grouped Query Attention (GQA) layers are self-attention layers where key and value heads reuse a subset of query heads; this reduces the number of parameters and computation.
- Rotary Position Embeddings (RoPE): RoPE embeddings are a method to encode positional information in transformers by rotating query and key vectors in the complex plane — this is particularly helpful for longer sequences.
- Efficient (Flash) Attention: FlashAttention is an optimized algorithm for performing the self-attention mechanism efficiently on GPUs.
- SwiGLU non-linearities: SwiGLU is a type of activation function that combines the Swish non-linearity with Gated Linear United (GLU); it is essentially both a gating mechanism and a non-linearity.
- Root Mean Square Layer Pre-Normalization (RMSNorm): The RMSNorm is a type of layer normalization that only normalizes inputs based on their root mean square.
We re-implement all of these features ourselves in pure pytorch (with the exception of flash attention — we simply use Pytorch's native implementation). We re-invented the wheel for two reasons: first, finding a simple, well-document implementation of a state-of-the-art model is not super trivial; second, we want to encourage users to experiment and develop new architectures. Our model has minimal dependencies, and maximal documentation making it very easy to tinker with. We believe that the pico-decoder model is the most minimal implementation possible of an auto-regressive transformer model without compromising on any of the key features of state-of-the-art models.
Data Loading
Dataloading for the pico-train codebase is simple — we strongly recommend using the existing pico-lm/pretokenized-dolma
dataset and load it in using streaming mode. This means that rather than downloading the entire dataset, only a small shard (less than 100MB) of data are downloaded at a time. Note that all of the data in the pico-lm/pretokenized-dolma
have already been tokenized — in particular, it has been tokenized using the allenai/OLMo-7B-0724-hf
BPE-based tokenizer that can be found on HuggingFace. Therefore, when using this dataset no additional preprocessing of the data needs to occur; each sample from the dataset will contain a input_ids
entry that correspond to the tokenized text and which can be sent to the model as is. Recall also that we have pre-processed the data into chunks of 2049 tokens. In other words, the length of the input_ids
in each sample is 2049. During training, we only pass the first 2048 tokens to the model (the model has a sequence length of 2048) and use as labels these same tokens just shifted over by 1. This is why we preprocess the tokens into 2049 tokens rather than 2048 tokens.
Batching Strategy
By default, we train our models using a batch size of 1024. Considering the fact that each sample contains 2048 tokens, this means that at each learning step the model is trained on over 2 million tokens (1024*2048). That's pretty big. Even when using a small model, it is unlikely that all of this data will fit onto one GPU device. One common strategy to deal with this is to use gradient accumulation.
Gradient accumulation is a common strategy in which a model is fed multiple batches of data before performing a gradient update step. This has the effect of increasing the total amount of data a model sees in each batch step without fitting all of the data in one go. By default, we implement this strategy in the training loop. There are some important things to keep in mind when doing this. In the config, when we specify a batch size (e.g. 1024) and gradient accumulation steps (e.g. 8) the actual 'sub-batch' size observed by the model in each loop is the batch_size divided by the accumulation steps (e.g. 1024/8=128). The difference between batch size and 'sub-batch' size is a bit subtle and can lead to some bugs. We've tried to keep the nomenclature 'batch' vs 'sub-batch' vs consistent in the training loop. Therefore, while the main trainer loops over sub-batch steps, we compute within each loop if we have seen enough sub-batches to constitute a full batch step.
Additionally, keep in mind that if you are training both using multiple GPUs and gradient accumulation these will both effect the size of this sub-batch. Going back to our previous example, say we specify a batch size of 1024, we set gradient accumulation steps to 8 and we're training using 4 GPUs. In this case, because we use data parallelism across GPUs, each GPU will observe a sub-batch size of 1024/8/4 = 32. The logic for computing the sub-batch size dynamically is part of the trainer and happens before the start of training.
Distributed Training
By default we suggest you train your distributed models with Deepspeed - a framework for gaining some performance boosts from multi-GPU training. Deepspeed is automatically built into fabric; in fact, by default if you specify a config to use multiple GPUs we automatically configure fabric to use Deepspeed.
Deepspeed has multiple stages of parallelism- the first two distribute the optimizer and gradient states across GPUs. The third also distributes model parameters across GPUs. One important caveat is that we do not support this third stage — that is, model-parallel training. The reason is that doing so complicates the logic for extracting activations and gradients from the model. We might consider adding this functionality in the future, but given that the goal of Pico is to focus on studying small language models implementing model parallelism is a bit overkill. By default, we setup deepspeed to use stage 2 parallelism.
Checkpointing and Version Control
One of the distinct features of Pico is the version controlling system used in the checkpointing process. To support reproducibility and experimentation, pico-train includes built-in 'advanced' checkpointing that automatically tracks key aspects of the training process. We use Weights and Biases to log the training process, and HuggingFace to store the model checkpoints.
Locally, the pico-train saves out checkpoint information in the following structure:
In addition to storing these checkpoints locally, we automatically also upload these to a HuggingFace repository that you can specify in the configs. For version controlling checkpoints, we use the following schema. As we train the model, we treat each set of checkpoints as a new 'commit' to the model. Thus over the course of training, while no new checkpoints are added to the repo the history of checkpoint is tracked in the form of commits and can be recovered by 'checking out' a given commit. Moreover, for each type of checkpoint (i.e. learning dynamics, model weights) we store a separate commit that contains information about the step corresponding to this commit and the type of checkpoint.
Committed by rdiehlmartinez 3 days ago
Committed by rdiehlmartinez 3 days ago
Committed by rdiehlmartinez 3 days ago
Committed by rdiehlmartinez 3 days ago
Committed by rdiehlmartinez 3 days ago
Let's take a look at the above example. We note that each commit specifies in the commit message the training step and the type of data being version controlled. On the HuggingFace repository we will only see a single copy of each file that we version control, corresponding to the latest version of that file (just like you would see on github). However, accessing an earlier version of the file is easy — all you have to do is find the commit id corresponding to the version of the file at a given step and run git checkout
. For instance, if our model has been trained to step 50,000 but we want to access the learning dynamics data at step 9000 all we need to do it run git checkout eb16af3
to return to this earlier stage of the training process.
Learning Dynamics
Arguably the real 'secret-sauce' of the pico-train framework is the advanced checkpointing that we use to store out checkpoint information which we can later use to compute the learning dynamics of the trained models. A key feature of the checkpoints we store is that they include activations and gradients of the models over the course of training. This enables users to easily analyze the learning dynamics of the model using the pico-analyze package.
As previously indicated, the information we extract from the model are the activations and gradients of the model with respect to two sets of data — the current training data and a batch of evaluation data. At each checkpointing step, learning dynamics metrics are computed with respect to the current state of the model. In other words, when we store the learning dynamics at checkpoint step N of the model this is the state of the model after N gradient steps have been performed. The training data we use to compute gradients and activations on are the the data that a model that has observed N training updates will observe at its next gradient step. Note that since this data obviously changes between checkpoint steps, we also want to extract the activations and the gradients of the model on a set of data that does not change between different checkpoints: this is why we pick a batch of static evaluation data that we also compute metrics with respect to.
In the config, we need to specify what layers we would like to compute activations and gradients for. We could do so for all layers, but the checkpoint files would end up being massive (~100 GB for each checkpoint). In reality, we need to pick and choose which layers we find most interesting to compute learning dynamics for. One of the reasons we develop pico-train is because we don't necessarily know which layers these are yet — we want users to easily be able to pick and experiment with using different layers. With the initial release of our models we need to make some initial choice - in line with some previous interpretability research, we decide to focus on the layers of models that (using the jargon of mechanistic interpretability) 'write into the residual stream'. That is to say, the layers of a model that project from a model's internal state to the embedding space. In practical terms, we select the final projection matrix of the Swiglu layer (layers with names ending in swiglu.w_2
) and the output and value matrices of the self-attention layer (layers with names ending attention.v_proj
and attention.o_proj
).
Here's how you can configure learning dynamics in your YAML config file:
# Learning dynamics configuration
checkpointing:
learning_dynamics:
layer_suffixes:
- "attention.v_proj"
- "attention.o_proj"
- "swiglu.w_2"
sequence_idx: -1 # Extract from the last token in the sequence
batch_size: 8 # Batch size for extracting learning dynamics
eval_data: "pico-lm/pretokenized-paloma-tinsy" # Static evaluation dataset
Setup
Clone Repository
Getting started with pico-train is straightforward. First, clone the repository from GitHub:
git clone https://github.com/pico-lm/pico-train.git
cd pico-train
This will create a local copy of the pico-train codebase on your machine. The repository contains all the necessary code to analyze model checkpoints generated with pico-train or other compatible frameworks.
Environment Variables
pico-train integrates with Weights & Biases (wandb) for visualization and experiment tracking, and optionally with Hugging Face for accessing model checkpoints. Setting up the appropriate environment variables will enable seamless logging.
Create a .env
file at the root of your pico-train directory:
# .env
export WANDB_API_KEY=your_wandb_key
export HF_TOKEN=your_huggingface_token
To obtain your wandb access token, go to https://wandb.ai/authorize. For your Hugging Face token, visit https://huggingface.co/docs/hub/en/security-tokens.
Installing Dependencies
pico-train uses Poetry for dependency management, which ensures consistent environments across different machines. The simplest way to set up pico-train is to run the setup script:
source setup.sh
This script will check your environment, install necessary tools, and set up a Poetry virtual environment with all dependencies.
Getting Started
Training Configuration Overview
To get started using pico-train, you need to define a configuration file. pico-train uses simple YAML files to override the defaults set in the dataclasses defined in src/config
.
Once you have a config file specified - e.g. my_config.yaml
- you can launch the pico-train process by calling:
poetry run train --config_path my_config.yaml
Below we provide example configurations for different aspects of the training process. Click on each tab to see the relevant configuration options.
Model Configuration
Configure the architecture and parameters of the Pico Decoder model.
model:
model_type: "pico_decoder"
# Pico Decoder Model; NOTE: the defaults are set to those for the large pico-decoder model
d_model: 768
n_layers: 12
vocab_size: 50304
batch_size: 1024
max_seq_len: 2048
attention_n_heads: 12
attention_n_kv_heads: 4
activation_hidden_dim: 3072
norm_eps: 1e-6
position_emb_theta: 10000.0
Key parameters:
d_model
: Embedding dimension (768 for most sizes)n_layers
: Number of transformer layers (varies by model size)attention_n_heads
: Number of attention headsattention_n_kv_heads
: Number of key/value heads (for GQA)
Training Configuration
Configure training hyperparameters, optimization, and hardware settings.
training:
fabric:
# NOTE: The total number of GPUs used is num_nodes*num_devices
num_nodes: 1 # The total number of nodes you would like to use
num_devices: 1 # The number of devices PER node
precision: "bf16-mixed"
accelerator: "cuda"
optimization:
optimizer: "adamw"
lr: 3e-4
lr_scheduler: "linear_with_warmup"
lr_warmup_steps: 2500
# gradient accumulation steps allow us to use a large batch_size
gradient_accumulation_steps: 128
max_steps: 200000
Key parameters:
num_nodes
&num_devices
: Control distributed traininglr
: Learning rate (default: 3e-4)max_steps
: Total training stepsgradient_accumulation_steps
: Number of steps to accumulate gradients
Data Configuration
Configure training data sources and tokenization.
data:
dataset:
name: "pico-lm/pretokenized-dolma"
dataloader:
# This is the number of examples observed by the model before taking a gradient update step
# NOTE: In practice if gradient_accumulation_steps are set to > 1 at each step only batch_size//gradient_accumulation_steps number of examples are fed to the model
batch_size: 1024
tokenizer:
name: "allenai/OLMo-7B-0724-hf"
vocab_size: 50304
Key parameters:
dataset.name
: HuggingFace dataset ID (recommended: pico-lm/pretokenized-dolma)batch_size
: Number of examples per gradient updatetokenizer.name
: HuggingFace tokenizer ID
Evaluation Configuration
Configure model evaluation frequency and metrics.
evaluation:
tasks:
- name: "paloma"
enabled: true
batch_size: 32
compute_perplexity: true
checkpoint_frequency: 1000
Key parameters:
enabled
: Turn evaluation on/offbatch_size
: Batch size for evaluation (typically smaller than training)compute_perplexity
: Calculate perplexity metriccheckpoint_frequency
: Steps between evaluations
Checkpointing Configuration
Configure checkpoint saving and learning dynamics tracking.
checkpointing:
run_name: None # name of the run
# Directory configuration
runs_dir: "runs"
checkpoints_dir: "checkpoints"
logs_dir: "logs"
fabric_checkpoint_dir: "fabric_state"
fabric_checkpoint_filename: "checkpoint.pt"
learning_dynamics_dir: "learning_dynamics"
save_every_n_steps: 1000
save_to_hf: False
hf_checkpoint:
repo_id: "" #specify the name of your huggingface repo id
collection_slug: null
training:
# used to specify whether to automatically resume training from the last checkpoint
auto_resume: True
evaluation:
eval_results_dir: "eval_results"
learning_dynamics:
layer_suffixes: ["attention.v_proj", "attention.o_proj", "swiglu.w_2"]
sequence_idx: -1 # i.e., last token of the sequence
# The batch size used while extracting gradients and activations
batch_size: 8
# The batch of evaluation data to use to extract activations and gradients from
eval_data: "pico-lm/pretokenized-paloma-tinsy"
Key parameters:
save_every_n_steps
: Checkpoint frequencysave_to_hf
: Enable HuggingFace uploadhf_checkpoint.repo_id
: HuggingFace repository IDlearning_dynamics.layer_suffixes
: Layers to track for analysis
Monitoring Configuration
Configure training monitoring and logging with Weights & Biases.
monitoring:
logging:
log_level: "INFO"
log_every_n_step: 100
save_to_wandb: False #set to True to log out to wandb
wandb:
project: "" # name of the wandb project
entity: "" # your wandb entity
Key parameters:
log_level
: Verbosity of logging (INFO, DEBUG, etc.)save_to_wandb
: Enable Weights & Biases integrationwandb.project
: W&B project namewandb.entity
: W&B username or organization
Creating a Configuration File
Combine sections from the tabs above into a single YAML file. You only need to override the specific values you want to change from the defaults.
For example, we provide the following minimal configuration file in configs/demo.yaml that you can use as a starting point:
# demo.yaml
data:
dataloader:
batch_size: 32
checkpointing:
run_name: "name-of-your-run"
save_every_n_steps: 50
save_to_hf: true
hf_checkpoint:
repo_id: "name-of-your-huggingface-repo-id"
learning_dynamics:
batch_size: 16
model:
d_model: 96
activation_hidden_dim: 384
evaluation:
paloma:
batch_size: 32
monitoring:
save_to_wandb: true
wandb:
project: "name-of-your-wandb-project"
entity: "name-of-your-wandb-entity"
logging:
log_every_n_steps: 10
training:
max_steps: 100
optimization:
lr: 0.001
lr_warmup_steps: 30
gradient_accumulation_steps: 2
fabric:
num_devices: 1
Configuration Tips
- For multi-GPU training, adjust
num_nodes
andnum_devices
based on your hardware - Set
save_to_hf
toTrue
and provide yourrepo_id
to save checkpoints to HuggingFace - Set
save_to_wandb
toTrue
to track your training with Weights & Biases - Adjust
gradient_accumulation_steps
to effectively increase batch size on limited hardware
Picking Hyperparameters
The configuration above defines all the hyperparameters needed to train your model. Some of the most important to consider are shown below.
Model Architecture
d_model
: Embedding dimensionn_layers
: Number of transformer layersattention_n_heads
: Number of attention headsattention_n_kv_heads
: Number of key/value heads
Training & Optimization
batch_size
: Batch size (default: 1024)lr
: Learning rate (default: 3e-4)max_steps
: Maximum training steps (default: 200000) # NOTE: the current pico-decoder models were paused at 50,000 steps of traininglr_scheduler
: Learning rate schedule (default: "linear_with_warmup")