# Model Optimization Context Use this pack when generating stage code for any of: - `optimize/modelopt/quantize` — PTQ quantization (FP8/NVFP4/INT) - `optimize/modelopt/prune` — structured pruning - `optimize/modelopt/distill` — teacher-student distillation These wrap NVIDIA Model Optimizer through Megatron-Bridge. The wrappers stay generic; algorithm-specific knobs go in YAML under `args:`. **Don't fork upstream scripts into generated projects** unless the user explicitly needs custom algorithm code. ## Live Repo Verification Read `../CATALOG.md`, `../ARTIFACTS.md`, and `../COMMANDS.md` before this pack. After the bundled references select a ModelOpt step, verify: - Manifests: `src/nemotron/steps/optimize/modelopt//step.toml` - Per-step README: `src/nemotron/steps/optimize/modelopt//README.md` - Category README: `src/nemotron/steps/optimize/modelopt/README.md` - Shared runner: `src/nemotron/steps/_runners/modelopt.py` - Configs: `src/nemotron/steps/optimize/modelopt//config/default.yaml` plus `tiny.yaml`; quantize also ships `fp8.yaml` and `nvfp4.yaml`. ## Folder choice Use `optimize/modelopt` as the umbrella — broader than `quantize` because distillation can be a quality-recovery or transfer stage, not only a compression stage. `compression` would be too narrow. ## Shared wrapper pattern All three steps drive `torchrun` against an upstream script with three YAML sections: ```yaml script: path: null # null = use container default flag_style: hyphen # quantize uses hyphen; prune & distill use underscore args: # Upstream script args go here. Forwarded as -- . torchrun: nproc_per_node: 8 extra_args: [] # literal escape hatch for new upstream flags ``` Generated `run.py` should: 1. Load YAML. 2. Resolve hydra-style CLI overrides. 3. Build a `torchrun ... ...` command. 4. Print the command. 5. `os.execvp` it. Don't hardcode model-specific config in Python. Put ModelOpt controls under `args:`; keep Python a launcher only. ## Quantization (`optimize/modelopt/quantize`) Step.toml contract: - Consumes: `checkpoint_hf` (required). - Produces: `checkpoint_megatron` (export to HF afterward if needed). Manifest defaults: - `args.export_quant_cfg = "fp8"` (also: `int8_sq`, `fp8_blockwise`, `int4_awq`, `w4a8_awq`, `nvfp4` per the manifest description). - `args.calib_size = 512`. - `extra_args = []`. Strategies (from step.toml): - Hopper / H100 → start with `config/fp8.yaml`, `args.export_quant_cfg=fp8`. - Blackwell / B200 → start with `config/nvfp4.yaml`, `args.export_quant_cfg=nvfp4`. - Need HF output → run `/opt/Megatron-Bridge/examples/quantization/export.py` after. Default config (`default.yaml`) shape (truncated): ```yaml script: path: null flag_style: hyphen args: hf_model_id: nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-Base-BF16 trust_remote_code: true export_quant_cfg: fp8 megatron_save_path: ${oc.env:NEMO_RUN_DIR,${oc.env:PWD}/output}/optimize/quantize/ calib_size: 512 torchrun: nproc_per_node: 8 extra_args: [] ``` The pre-built `fp8.yaml` and `nvfp4.yaml` set `export_quant_cfg` and `calib_size` (NVFP4 typically uses ~2000 calibration samples vs ~512 for FP8). Calibration guidance: - Smoke tests → lower `calib_size`, keep the same command shape. - FP8 PTQ flows commonly use ≈256–512 calibration samples. - NVFP4 / QAD-oriented flows commonly use ~2000 calibration samples and longer context. Output: Megatron distributed checkpoint. If the next stage needs HF format, add an explicit conversion via the upstream `export.py`. ## Pruning (`optimize/modelopt/prune`) Step.toml contract: - Consumes: `checkpoint_hf` (required). - Produces: `checkpoint_hf` (pruned). Manifest defaults: - `args.prune_target_params = 6e9`. - `args.prune_export_config` (manual architecture dict; leave unset to use search). - `args.hparams_to_skip` (e.g. `num_attention_heads`). - `extra_args = []`. Strategies: - Target search: set `args.prune_target_params`, leave `args.prune_export_config: null`. - Fixed architecture: set `args.prune_export_config`, set `args.prune_target_params: null`. - Layer count not divisible by PP size: use `args.num_layers_in_first_pipeline_stage` / `args.num_layers_in_last_pipeline_stage` for uneven PP. Default config shape: ```yaml script: path: null flag_style: underscore args: hf_model_name_or_path: Qwen/Qwen3-8B output_hf_path: ${oc.env:NEMO_RUN_DIR,${oc.env:PWD}/output}/optimize/prune/pruned-hf pp_size: 2 prune_target_params: 6e9 prune_export_config: null hparams_to_skip: null torchrun: nproc_per_node: 2 extra_args: [] ``` Common fields for `prune_export_config`: `hidden_size`, `ffn_hidden_size`, `num_layers`, `num_attention_heads`, `num_query_groups`. Output: pruned HF checkpoint. Distillation usually follows when quality needs recovery — chain `optimize/modelopt/distill` with the original BF16 as teacher and the pruned checkpoint as student. ## Distillation (`optimize/modelopt/distill`) Step.toml contract: - Consumes: `checkpoint_hf` (required, teacher + student) + `binidx` (optional, real-data runs). - Produces: `checkpoint_megatron`. Manifest defaults: - `args.teacher_hf_path` / `args.student_hf_path` — required. - `args.data_paths` — Megatron blend `[weight, prefix, ...]`. - `args.use_mock_data = false`. - `extra_args = []`. Strategies: - Quality recovery after pruning/quantization → teacher = original BF16/HF checkpoint, student = optimized checkpoint. - Smoke test → `args.use_mock_data=true`, `args.seq_length=512`, `args.train_iters=100`, small `args.eval_iters`. - Need HF output → set `args.hf_export_path` and `args.student_hf_model`, or convert a saved Megatron iteration later. Default config shape: ```yaml script: path: null flag_style: underscore args: teacher_hf_path: Qwen/Qwen3-8B student_hf_path: Qwen/Qwen3-4B tp_size: 8 data_paths: null use_mock_data: false seq_length: 8192 mbs: 1 gbs: 768 train_iters: 15000 output_dir: ${oc.env:NEMO_RUN_DIR,${oc.env:PWD}/output}/optimize/distill/run torchrun: nproc_per_node: 8 extra_args: [] ``` Real-data runs expect Megatron bin/idx prefixes: ```yaml args: data_paths: - 1.0 - /data/tokenized/domain_text_document ``` Use `data_prep/pretrain_prep` first when data starts as HF/local text. `use_mock_data: true` is **plumbing only**, not a quality signal. Distillation patterns: - Pruned student: teacher = original BF16/HF, student = pruned HF. - Quantized recovery: teacher = original BF16/HF, student = optimized checkpoint. - Standalone small model: teacher = larger model, student = smaller HF model. Output: Megatron checkpoint under `output_dir` (or HF if inline export configured). ## Pipeline placement Common chains: ``` sft/automodel → optimize/modelopt/quantize → eval/model_eval sft/automodel → optimize/modelopt/prune → optimize/modelopt/distill → eval/model_eval data_prep/pretrain_prep → optimize/modelopt/distill → eval/model_eval ``` Artifact rules: - Quantize: `checkpoint_hf` → `checkpoint_megatron`. - Prune: `checkpoint_hf` → `checkpoint_hf`. - Distill: `checkpoint_hf` (+ optional `binidx`) → `checkpoint_megatron`. - Insert `convert/*` whenever crossing HF / Megatron format boundaries. ## Patterns to cite - `convert-checkpoint-safety` in `../PATTERNS.md` — quantize / prune / distill from a clean checkpoint, not from training-state files. - `eval-before-and-after-training` in `../PATTERNS.md` — measure quantized / pruned / distilled quality against the unoptimized baseline. - `byob-benchmark-design` in `../PATTERNS.md` — calibration and quality claims should be scored on a representative held-out benchmark, not on calibration loss alone. - `peft-adapter-merge-discipline` in `../PATTERNS.md` — when the optimization input is a LoRA-adapter checkpoint, merge first. ## Staleness checks When this pack drifts: - Refresh defaults from each algo's `step.toml` and `config/default.yaml`. - Verify the upstream scripts still exist in the container image (`/opt/Megatron-Bridge/examples/quantization/quantize.py`, `/opt/Model-Optimizer/examples/megatron_bridge/prune_minitron.py`, `/opt/Model-Optimizer/examples/megatron_bridge/distill.py`). - Check `flag_style` per algo: quantize uses **hyphen**, prune and distill examples use **underscore**. - Ensure manifest `[reference]` URLs are intact.