# NeMo-RL Alignment Context Use this pack when generating stage code for any of the alignment steps: - `rl/nemo_rl/dpo` - `rl/nemo_rl/rlvr` - `rl/nemo_rl/rlhf` Wrappers are deliberately generic. Algorithm-specific behavior lives in YAML. Generated projects expose clear config files, not Python switches. ## Live Repo Verification Read `../CATALOG.md`, `../ARTIFACTS.md`, and `../COMMANDS.md` before this pack. After the bundled references select an RL step, verify: - Manifests: `src/nemotron/steps/rl/nemo_rl//step.toml` - Per-step README: `src/nemotron/steps/rl/nemo_rl//README.md` - Category README: `src/nemotron/steps/rl/nemo_rl/README.md` - Shared runner: `src/nemotron/steps/_runners/nemo_rl.py` - NeMo-Gym GRPO runner (when `env.should_use_nemo_gym=true`): `src/nemotron/steps/_runners/nemo_rl_grpo_nemo_gym.py` ## Shared runner shape `exec_nemo_rl_example(default_config, upstream_script, description)` is the default — it forwards `--config` and Hydra-style overrides to a NeMo-RL example script via `os.execvp` (so the runner is replaced, not subprocessed). For GRPO (RLVR/RLHF), use `exec_or_run_nemo_rl_grpo(...)`. It checks `env.should_use_nemo_gym` in the loaded config: - `false` → exec the upstream NeMo-RL example script. - `true` → call `nemo_rl_grpo_nemo_gym.run_nemo_gym_grpo(config_path, overrides)` directly. The local config loader in `nemo_rl.py` supports a tiny `defaults: ` / `defaults: [a.yaml, b.yaml]` form for layering — it is **not** a full Hydra composition engine. Generated stage code should: 1. Resolve config path and Hydra overrides. 2. Use `exec_nemo_rl_example` for DPO; `exec_or_run_nemo_rl_grpo` for RLVR/RLHF. 3. Let NeMo-RL own the training loop. Don't reimplement. ## DPO (`rl/nemo_rl/dpo`) Use when training data is static preference pairs and there's no executable reward function. Step.toml contract: - Consumes: `training_jsonl` (prompt + chosen + rejected) + `checkpoint_megatron` (SFT policy). - Produces: `checkpoint_megatron` (DPO-aligned). Required JSONL shape: ```json {"prompt": "...", "chosen": "...", "rejected": "..."} ``` Manifest defaults / key knobs: - `dpo.reference_policy_kl_penalty = 0.05` (β). - Policy checkpoint path, preference dataset path, learning rate, global batch size. Strategy: when KL collapses or loss diverges, raise the KL penalty (0.1–0.3) or lower the learning rate. Upstream entry: `https://github.com/NVIDIA-NeMo/RL/blob/main/examples/run_dpo.py` ## RLVR / GRPO (`rl/nemo_rl/rlvr`) Use when reward can be verified programmatically (math final-answer matching, unit tests, exact/normalized matching, env success). Step.toml contract: - Consumes: `training_jsonl` (prompt + verifiable answer) + `checkpoint_megatron` (SFT policy). - Produces: `checkpoint_megatron` (RLVR-aligned). Manifest defaults / key knobs: - `grpo.num_generations_per_prompt = 8` (group size). - `grpo.normalize_rewards = true` (normalize within group before computing advantages). - `env.should_use_nemo_gym = false` (set true to switch from upstream GRPO example to the in-repo NeMo-Gym runner). Strategies: - Low reward variance → raise `num_generations_per_prompt`, use leave-one-out baselines. - For Super3-style data or resource-server rewards: start from `config/nemo_gym.yaml` and set `data.train.data_path`, `data.validation.data_path`, and `env.nemo_gym.config_paths`. Upstream entry: `https://github.com/NVIDIA-NeMo/RL/blob/main/examples/run_grpo.py` ## RLHF (`rl/nemo_rl/rlhf`) Use when reward is learned from a reward model (RLHF / GenRM-style judge), not directly verifiable. The current step uses NeMo-Gym for GenRM comparison rewards by default. Step.toml contract: - Consumes: `training_jsonl` (prompts) + `checkpoint_megatron` (SFT policy) + `checkpoint_hf` (reward / classifier model). - Produces: `checkpoint_megatron` (RLHF-aligned policy). Manifest defaults / key knobs: - `grpo.num_generations_per_prompt = 8`. - `env.nemo_gym.genrm_model.responses_api_models.vllm_model.model` — HF path or local path of the GenRM judge served by NeMo-Gym. Strategies: - Reward model saturates / reward hacking → increase KL penalty, lower learning rate, add reward clipping. - For Super3-style data: keep `env.should_use_nemo_gym=true` and point `data.train.data_path` / `data.validation.data_path` at prepared NeMo-Gym JSONL. In-repo entry path: `src/nemotron/recipes/super3/stage2_rl/stage3_rlhf`. ## Data prep (use `data_prep/rl_prep` upstream) For DPO, preserve `prompt`, `chosen`, `rejected`; validate non-empty chosen/rejected; keep train/validation splits deterministic. For RLVR, preserve verifier fields (`answer`, `tests`, `expected_output`, env metadata); materialize remote resources before cluster training. For RLHF, preserve prompt metadata required by the reward model. ## Pipeline placement ``` sdg/data_designer → data_prep/rl_prep → rl/nemo_rl/dpo data_prep/rl_prep → rl/nemo_rl/rlvr data_prep/rl_prep → rl/nemo_rl/rlhf ``` ## Artifact rules - All three RL steps consume `training_jsonl`. - DPO and RLVR consume an SFT `checkpoint_megatron` policy. - RLHF additionally consumes a reward model in `checkpoint_hf` format. - All three produce `checkpoint_megatron`. Add `convert/megatron_to_hf` if the next consumer expects HF. ## Patterns to cite - `rl-validate-rewards-before-scale` in `../PATTERNS.md` — sanity-check rewards before scaling rollouts. ## Staleness checks When this pack drifts: - Refresh the GRPO example URL in NeMo-RL upstream. - Re-verify config fields against the installed NeMo-RL release/container. - Confirm the NeMo-Gym field names (`env.should_use_nemo_gym`, `env.nemo_gym.config_paths`, the `genrm_model` path) still match the runner. - Refresh manifest defaults from each algo's `step.toml`.