# Megatron-Bridge Parallelism And Performance Context

Use this pack only after the selected Megatron-Bridge step and config are known.
It helps tune distributed shape; it does not replace the step config.

## Product Contract

- Start from the checked-in tiny/default config and change only the parallelism
  knobs required by the user's hardware and model.
- Keep tokenizer, packed sequence length, model sequence length, and dataset
  sequence length aligned.
- Do not tune for throughput before the job launches, loads data, and saves a
  small checkpoint successfully.

## Core Knobs

| Knob | Use |
|---|---|
| tensor parallelism (TP) | shard large matrix ops; world size must divide cleanly |
| pipeline parallelism (PP) | shard layers across GPUs; useful for very deep models |
| context parallelism (CP) | long sequence memory relief |
| expert parallelism (EP) | MoE expert sharding when the recipe supports it |
| sequence parallelism (SP) | memory reduction commonly paired with TP |
| activation recomputation | memory relief at compute cost |
| distributed optimizer/FSDP | optimizer-state and gradient memory relief |

## Tuning Order

1. Validate the artifact path and tiny data first.
2. Set TP/PP/CP/EP to a legal shape for the model and GPU count.
3. Keep micro batch size at 1 until memory is proven.
4. Enable activation recomputation before reducing sequence length.
5. Increase global batch size only after the data-parallel size is known.
6. Add communication overlap only after correctness and checkpointing work.

## SFT/PEFT Notes

- Megatron SFT/PEFT consumes packed Parquet from `data_prep/sft_packing`.
- `seq_length` must match the packing `pack_size`.
- For adapter jobs, keep base checkpoint path and adapter output path distinct.

## Failure Modes

- `world_size_not_divisible`: adjust nodes, GPUs per node, TP, PP, CP, or EP.
- `sequence_length_mismatch`: repack data or align model/dataset sequence
  length.
- `oom`: lower micro batch, enable recomputation, increase parallelism, or
  switch to PEFT.
