# Megatron-Bridge Pretraining Context

Use this pack when configuring `pretrain/megatron_bridge`.

## Product Contract

- This step trains or continued-pretrains with Megatron-Bridge and produces a
  Megatron distributed checkpoint.
- It consumes `binidx` data plus a `blend.json` from
  `data_prep/pretrain_prep`. It does not consume SFT packed Parquet.
- Prefer YAML overrides on the existing step config. Do not write custom
  training loops.

## Required Inputs

- `dataset.data_paths`: path to the emitted `blend.json`.
- `seq_length`: aligned with tokenizer/data-prep assumptions.
- Checkpoint mode:
  - `load_hf_weights=true` for continued pretraining from an HF base.
  - `checkpoint.pretrained_checkpoint` or equivalent recipe checkpoint field
    when resuming from a Megatron checkpoint.
- Output checkpoint directory distinct from input data and source checkpoints.

## When To Pick This Step

| Requirement | Decision |
|---|---|
| Megatron checkpoint output | Use `pretrain/megatron_bridge` |
| Very large distributed training | Use `pretrain/megatron_bridge` |
| Small HF-native smoke or simple CPT | Consider `pretrain/automodel` |
| Raw text input | Run `data_prep/pretrain_prep` first |

## Configuration Rules

- Keep global batch size divisible by data-parallel size.
- Start with micro batch size 1 for new hardware/model shapes.
- Use lower learning rates and shorter runs for domain CPT on small corpora.
- Keep train/valid/test split paths from the same data-prep output.
- Do not mix SFT packed data and pretraining bin/idx data.

## Failure Modes

- `missing_blend_json`: run `data_prep/pretrain_prep`.
- `sequence_length_mismatch`: align data prep, recipe, and model sequence
  length.
- `transformer_engine_userbuffer_failure`: set `UB_SKIPMC=1` in the runtime env
  when CUDA multicast is unavailable.
