This guide will help you get started with pico-train, a framework for training language models with a focus on simplicity and reproducibility.

Overview

Introduction

Pico is defined as something "very small." Scientifically, a picometer is a trillionth of a meter.

As language models have grown in size, the benchmark for what a small language model is has continually increased. With large language models now containing trillions of parameters, a model with "only" a few billion parameters is often considered small. This trend toward larger models does not appear to be going away anytime soon. One major difficulty is that the larger the model, the harder it becomes to determine what works well and what doesn't—it is challenging to be rigorous in model design and training paradigms. Because re-training models is expensive, we often cannot afford comprehensive ablation studies.

One promising step toward solving this problem is releasing language models along with all of their training checkpoints. Rather than merely sharing the final model weights, some model suites, like Olmo and Pythia, have begun providing comprehensive intermediate checkpoints. These frameworks, however, still do not go far enough. What we need is a more complete solution that builds on this concept.

If we want to understand how language models learn—and how language models of different sizes learn—we need a system for training them ourselves. We need a framework—almost a laboratory—that lets us tweak a language model, apply that change to models of various sizes, and observe how it affects learning outcomes.

Doing this kind of work is inherently difficult if a "small" model still contains several billion parameters. Training models on that scale for multiple configurations is simply not very feasible or cost-effective. This is why Pico aims to facilitate experimentation at more manageable scales, allowing us to systematically study language model behavior and, ultimately, improve the design, training, and optimization of larger models.

Design Philosophy

pico-train is a package for training language models. Our goal is not to develop a model training architecture, or a highly-optimized way of training models; but rather a simple and educational way of training language models cost-effectively and reproducibly. Ultimately, we believe it should feel trivial to build your own model suite like Pythia or Olmo using pico-train.

Many language models are often written and trained using libraries that obfuscate the model definition and training code. Just finding where in the code the optimizer.step() happens can be difficult. Our guiding principle with pico-train was to make users - especially earlier AI practitioners - able to understand the model architecture and training pipeline as quickly as possible. We offer a fully money-back guarantee; if you are not able to get up and started with pico-train in less than 15 minutes — this is an easy guarantee to uphold since Pico is entirely open-source and free.

pico-train is meant to be a 'maximally forkable' project. While other libraries try to do everything always, we try to implement only the essentials in a straightforward, unopinionated manner. We believe that if we provide a solid foundation for experimentation, users should be able to easily fork our project and begin tinkering on their own quickly.

Code Structure

Code Walkthrough

The code for pico-train is largely under the src directory. Let's explore the structure together:

src/

Pico Model

Out of the box, pico-train implements a Llama-style decoder; which we call pico-decoder. In the future we hope to expand out the number of models that we package with pico-train to more than just a decoder; a pico-diffusion, and pico-statespace models are in the works. For something to be called a pico-* model it must be a simple, well document and universal base version of the given model.

The pico-decoder model architecture implements many of the expected features of modern-day performant auto-regressive language models. It includes:

  • Stacked layers of Grouped Query Attention (GQA): Grouped Query Attention (GQA) layers are self-attention layers where key and value heads reuse a subset of query heads; this reduces the number of parameters and computation.
  • Rotary Position Embeddings (RoPE): RoPE embeddings are a method to encode positional information in transformers by rotating query and key vectors in the complex plane — this is particularly helpful for longer sequences.
  • Efficient (Flash) Attention: FlashAttention is an optimized algorithm for performing the self-attention mechanism efficiently on GPUs.
  • SwiGLU non-linearities: SwiGLU is a type of activation function that combines the Swish non-linearity with Gated Linear United (GLU); it is essentially both a gating mechanism and a non-linearity.
  • Root Mean Square Layer Pre-Normalization (RMSNorm): The RMSNorm is a type of layer normalization that only normalizes inputs based on their root mean square.

We re-implement all of these features ourselves in pure pytorch (with the exception of flash attention — we simply use Pytorch's native implementation). We re-invented the wheel for two reasons: first, finding a simple, well-document implementation of a state-of-the-art model is not super trivial; second, we want to encourage users to experiment and develop new architectures. Our model has minimal dependencies, and maximal documentation making it very easy to tinker with. We believe that the pico-decoder model is the most minimal implementation possible of an auto-regressive transformer model without compromising on any of the key features of state-of-the-art models.

Data Loading

Dataloading for the pico-train codebase is simple — we strongly recommend using the existing pico-lm/pretokenized-dolma dataset and load it in using streaming mode. This means that rather than downloading the entire dataset, only a small shard (less than 100MB) of data are downloaded at a time. Note that all of the data in the pico-lm/pretokenized-dolma have already been tokenized — in particular, it has been tokenized using the allenai/OLMo-7B-0724-hf BPE-based tokenizer that can be found on HuggingFace. Therefore, when using this dataset no additional preprocessing of the data needs to occur; each sample from the dataset will contain a input_ids entry that correspond to the tokenized text and which can be sent to the model as is. Recall also that we have pre-processed the data into chunks of 2049 tokens. In other words, the length of the input_ids in each sample is 2049. During training, we only pass the first 2048 tokens to the model (the model has a sequence length of 2048) and use as labels these same tokens just shifted over by 1. This is why we preprocess the tokens into 2049 tokens rather than 2048 tokens.

Batching Strategy

By default, we train our models using a batch size of 1024. Considering the fact that each sample contains 2048 tokens, this means that at each learning step the model is trained on over 2 million tokens (1024*2048). That's pretty big. Even when using a small model, it is unlikely that all of this data will fit onto one GPU device. One common strategy to deal with this is to use gradient accumulation.

Gradient accumulation is a common strategy in which a model is fed multiple batches of data before performing a gradient update step. This has the effect of increasing the total amount of data a model sees in each batch step without fitting all of the data in one go. By default, we implement this strategy in the training loop. There are some important things to keep in mind when doing this. In the config, when we specify a batch size (e.g. 1024) and gradient accumulation steps (e.g. 8) the actual 'sub-batch' size observed by the model in each loop is the batch_size divided by the accumulation steps (e.g. 1024/8=128). The difference between batch size and 'sub-batch' size is a bit subtle and can lead to some bugs. We've tried to keep the nomenclature 'batch' vs 'sub-batch' vs consistent in the training loop. Therefore, while the main trainer loops over sub-batch steps, we compute within each loop if we have seen enough sub-batches to constitute a full batch step.

Additionally, keep in mind that if you are training both using multiple GPUs and gradient accumulation these will both effect the size of this sub-batch. Going back to our previous example, say we specify a batch size of 1024, we set gradient accumulation steps to 8 and we're training using 4 GPUs. In this case, because we use data parallelism across GPUs, each GPU will observe a sub-batch size of 1024/8/4 = 32. The logic for computing the sub-batch size dynamically is part of the trainer and happens before the start of training.

Distributed Training

By default we suggest you train your distributed models with Deepspeed - a framework for gaining some performance boosts from multi-GPU training. Deepspeed is automatically built into fabric; in fact, by default if you specify a config to use multiple GPUs we automatically configure fabric to use Deepspeed.

Deepspeed has multiple stages of parallelism- the first two distribute the optimizer and gradient states across GPUs. The third also distributes model parameters across GPUs. One important caveat is that we do not support this third stage — that is, model-parallel training. The reason is that doing so complicates the logic for extracting activations and gradients from the model. We might consider adding this functionality in the future, but given that the goal of Pico is to focus on studying small language models implementing model parallelism is a bit overkill. By default, we setup deepspeed to use stage 2 parallelism.

Checkpointing and Version Control

One of the distinct features of Pico is the version controlling system used in the checkpointing process. To support reproducibility and experimentation, pico-train includes built-in 'advanced' checkpointing that automatically tracks key aspects of the training process. We use Weights and Biases to log the training process, and HuggingFace to store the model checkpoints.

Locally, the pico-train saves out checkpoint information in the following structure:

./

In addition to storing these checkpoints locally, we automatically also upload these to a HuggingFace repository that you can specify in the configs. For version controlling checkpoints, we use the following schema. As we train the model, we treat each set of checkpoints as a new 'commit' to the model. Thus over the course of training, while no new checkpoints are added to the repo the history of checkpoint is tracked in the form of commits and can be recovered by 'checking out' a given commit. Moreover, for each type of checkpoint (i.e. learning dynamics, model weights) we store a separate commit that contains information about the step corresponding to this commit and the type of checkpoint.

"Saving Learning Dynamics Data (val) -- Step 9000" [e3eee44]
Committed by rdiehlmartinez 3 days ago
"Saving Learning Dynamics Data (train) -- Step 9000" [eb16af3]
Committed by rdiehlmartinez 3 days ago
"Saving Evaluation Results -- Step 9000" [82a6482]
Committed by rdiehlmartinez 3 days ago
"Saving Fabric Checkpoint -- Step 9000" [af97c43]
Committed by rdiehlmartinez 3 days ago
"Saving HF Model -- Step 9000" [be7293e]
Committed by rdiehlmartinez 3 days ago

Let's take a look at the above example. We note that each commit specifies in the commit message the training step and the type of data being version controlled. On the HuggingFace repository we will only see a single copy of each file that we version control, corresponding to the latest version of that file (just like you would see on github). However, accessing an earlier version of the file is easy — all you have to do is find the commit id corresponding to the version of the file at a given step and run git checkout. For instance, if our model has been trained to step 50,000 but we want to access the learning dynamics data at step 9000 all we need to do it run git checkout eb16af3 to return to this earlier stage of the training process.

Learning Dynamics

Arguably the real 'secret-sauce' of the pico-train framework is the advanced checkpointing that we use to store out checkpoint information which we can later use to compute the learning dynamics of the trained models. A key feature of the checkpoints we store is that they include activations and gradients of the models over the course of training. This enables users to easily analyze the learning dynamics of the model using the pico-analyze package.

As previously indicated, the information we extract from the model are the activations and gradients of the model with respect to two sets of data — the current training data and a batch of evaluation data. At each checkpointing step, learning dynamics metrics are computed with respect to the current state of the model. In other words, when we store the learning dynamics at checkpoint step N of the model this is the state of the model after N gradient steps have been performed. The training data we use to compute gradients and activations on are the the data that a model that has observed N training updates will observe at its next gradient step. Note that since this data obviously changes between checkpoint steps, we also want to extract the activations and the gradients of the model on a set of data that does not change between different checkpoints: this is why we pick a batch of static evaluation data that we also compute metrics with respect to.

In the config, we need to specify what layers we would like to compute activations and gradients for. We could do so for all layers, but the checkpoint files would end up being massive (~100 GB for each checkpoint). In reality, we need to pick and choose which layers we find most interesting to compute learning dynamics for. One of the reasons we develop pico-train is because we don't necessarily know which layers these are yet — we want users to easily be able to pick and experiment with using different layers. With the initial release of our models we need to make some initial choice - in line with some previous interpretability research, we decide to focus on the layers of models that (using the jargon of mechanistic interpretability) 'write into the residual stream'. That is to say, the layers of a model that project from a model's internal state to the embedding space. In practical terms, we select the final projection matrix of the Swiglu layer (layers with names ending in swiglu.w_2) and the output and value matrices of the self-attention layer (layers with names ending attention.v_proj and attention.o_proj).

Here's how you can configure learning dynamics in your YAML config file:

    # Learning dynamics configuration
checkpointing:
  learning_dynamics:
    layer_suffixes:
      - "attention.v_proj"
      - "attention.o_proj"
      - "swiglu.w_2"
    sequence_idx: -1  # Extract from the last token in the sequence
    batch_size: 8     # Batch size for extracting learning dynamics
    eval_data: "pico-lm/pretokenized-paloma-tinsy"  # Static evaluation dataset
  

Setup

Clone Repository

Getting started with pico-train is straightforward. First, clone the repository from GitHub:

    git clone https://github.com/pico-lm/pico-train.git
cd pico-train
  

This will create a local copy of the pico-train codebase on your machine. The repository contains all the necessary code to analyze model checkpoints generated with pico-train or other compatible frameworks.

Environment Variables

pico-train integrates with Weights & Biases (wandb) for visualization and experiment tracking, and optionally with Hugging Face for accessing model checkpoints. Setting up the appropriate environment variables will enable seamless logging.

Create a .env file at the root of your pico-train directory:

    # .env
export WANDB_API_KEY=your_wandb_key
export HF_TOKEN=your_huggingface_token
  

To obtain your wandb access token, go to https://wandb.ai/authorize. For your Hugging Face token, visit https://huggingface.co/docs/hub/en/security-tokens.

Installing Dependencies

pico-train uses Poetry for dependency management, which ensures consistent environments across different machines. The simplest way to set up pico-train is to run the setup script:

    source setup.sh
  

This script will check your environment, install necessary tools, and set up a Poetry virtual environment with all dependencies.

Getting Started

Training Configuration Overview

To get started using pico-train, you need to define a configuration file. pico-train uses simple YAML files to override the defaults set in the dataclasses defined in src/config.

Once you have a config file specified - e.g. my_config.yaml - you can launch the pico-train process by calling:

    poetry run train --config_path my_config.yaml
  

Below we provide example configurations for different aspects of the training process. Click on each tab to see the relevant configuration options.

Model Configuration

Configure the architecture and parameters of the Pico Decoder model.

    model:
  model_type: "pico_decoder"
  # Pico Decoder Model; NOTE: the defaults are set to those for the large pico-decoder model
  d_model: 768
  n_layers: 12
  vocab_size: 50304
  batch_size: 1024
  max_seq_len: 2048
  attention_n_heads: 12
  attention_n_kv_heads: 4
  activation_hidden_dim: 3072
  norm_eps: 1e-6
  position_emb_theta: 10000.0
  

Key parameters:

  • d_model: Embedding dimension (768 for most sizes)
  • n_layers: Number of transformer layers (varies by model size)
  • attention_n_heads: Number of attention heads
  • attention_n_kv_heads: Number of key/value heads (for GQA)

Training Configuration

Configure training hyperparameters, optimization, and hardware settings.

    training:
  fabric:
    # NOTE: The total number of GPUs used is num_nodes*num_devices
    num_nodes: 1  # The total number of nodes you would like to use
    num_devices: 1  # The number of devices PER node
    precision: "bf16-mixed"
    accelerator: "cuda"
  optimization:
    optimizer: "adamw"
    lr: 3e-4
    lr_scheduler: "linear_with_warmup"
    lr_warmup_steps: 2500
    # gradient accumulation steps allow us to use a large batch_size
    gradient_accumulation_steps: 128
  max_steps: 200000
  

Key parameters:

  • num_nodes & num_devices: Control distributed training
  • lr: Learning rate (default: 3e-4)
  • max_steps: Total training steps
  • gradient_accumulation_steps: Number of steps to accumulate gradients

Data Configuration

Configure training data sources and tokenization.

    data:
  dataset:
    name: "pico-lm/pretokenized-dolma"
  dataloader:
    # This is the number of examples observed by the model before taking a gradient update step
    # NOTE: In practice if gradient_accumulation_steps are set to > 1 at each step only batch_size//gradient_accumulation_steps number of examples are fed to the model 
    batch_size: 1024
  tokenizer:
    name: "allenai/OLMo-7B-0724-hf"
    vocab_size: 50304
  

Key parameters:

  • dataset.name: HuggingFace dataset ID (recommended: pico-lm/pretokenized-dolma)
  • batch_size: Number of examples per gradient update
  • tokenizer.name: HuggingFace tokenizer ID

Evaluation Configuration

Configure model evaluation frequency and metrics.

    evaluation:
  tasks:
    - name: "paloma"
      enabled: true
      batch_size: 32
      compute_perplexity: true
  checkpoint_frequency: 1000
  

Key parameters:

  • enabled: Turn evaluation on/off
  • batch_size: Batch size for evaluation (typically smaller than training)
  • compute_perplexity: Calculate perplexity metric
  • checkpoint_frequency: Steps between evaluations

Checkpointing Configuration

Configure checkpoint saving and learning dynamics tracking.

    checkpointing:
  run_name: None  # name of the run
  # Directory configuration
  runs_dir: "runs"
  checkpoints_dir: "checkpoints"
  logs_dir: "logs"
  fabric_checkpoint_dir: "fabric_state"
  fabric_checkpoint_filename: "checkpoint.pt"
  learning_dynamics_dir: "learning_dynamics"
  save_every_n_steps: 1000
  save_to_hf: False
  hf_checkpoint:
    repo_id: ""  #specify the name of your huggingface repo id
    collection_slug: null
  training:
    # used to specify whether to automatically resume training from the last checkpoint
    auto_resume: True
  evaluation:
    eval_results_dir: "eval_results"
  learning_dynamics:
    layer_suffixes: ["attention.v_proj", "attention.o_proj", "swiglu.w_2"]
    sequence_idx: -1  # i.e., last token of the sequence
    # The batch size used while extracting gradients and activations
    batch_size: 8
    # The batch of evaluation data to use to extract activations and gradients from
    eval_data: "pico-lm/pretokenized-paloma-tinsy"
  

Key parameters:

  • save_every_n_steps: Checkpoint frequency
  • save_to_hf: Enable HuggingFace upload
  • hf_checkpoint.repo_id: HuggingFace repository ID
  • learning_dynamics.layer_suffixes: Layers to track for analysis

Monitoring Configuration

Configure training monitoring and logging with Weights & Biases.

    monitoring:
  logging:
    log_level: "INFO"
    log_every_n_step: 100
  save_to_wandb: False  #set to True to log out to wandb
  wandb:
    project: ""  # name of the wandb project
    entity: ""  # your wandb entity
  

Key parameters:

  • log_level: Verbosity of logging (INFO, DEBUG, etc.)
  • save_to_wandb: Enable Weights & Biases integration
  • wandb.project: W&B project name
  • wandb.entity: W&B username or organization

Creating a Configuration File

Combine sections from the tabs above into a single YAML file. You only need to override the specific values you want to change from the defaults.

For example, we provide the following minimal configuration file in configs/demo.yaml that you can use as a starting point:

    # demo.yaml
data:
  dataloader:
    batch_size: 32
  
checkpointing:
  run_name: "name-of-your-run"
  save_every_n_steps: 50

  save_to_hf: true
  hf_checkpoint:
    repo_id: "name-of-your-huggingface-repo-id"

  learning_dynamics:
    batch_size: 16

model:
    d_model: 96
    activation_hidden_dim: 384

evaluation: 
  paloma:
    batch_size: 32

monitoring:

  save_to_wandb: true
  wandb:
    project: "name-of-your-wandb-project"
    entity: "name-of-your-wandb-entity"

  logging:
    log_every_n_steps: 10

training:
  max_steps: 100

  optimization:
    lr: 0.001
    lr_warmup_steps: 30

    gradient_accumulation_steps: 2
  
  fabric:
    num_devices: 1
  

Configuration Tips

  • For multi-GPU training, adjust num_nodes and num_devices based on your hardware
  • Set save_to_hf to True and provide your repo_id to save checkpoints to HuggingFace
  • Set save_to_wandb to True to track your training with Weights & Biases
  • Adjust gradient_accumulation_steps to effectively increase batch size on limited hardware

Picking Hyperparameters

The configuration above defines all the hyperparameters needed to train your model. Some of the most important to consider are shown below.

Model Architecture

  • d_model: Embedding dimension
  • n_layers: Number of transformer layers
  • attention_n_heads: Number of attention heads
  • attention_n_kv_heads: Number of key/value heads

Training & Optimization

  • batch_size: Batch size (default: 1024)
  • lr: Learning rate (default: 3e-4)
  • max_steps: Maximum training steps (default: 200000) # NOTE: the current pico-decoder models were paused at 50,000 steps of training
  • lr_scheduler: Learning rate schedule (default: "linear_with_warmup")