# NeMo Data Designer Synthetic Data Context Use this pack when generating stage code for `sdg/data_designer`. The step builds a Data Designer pipeline from declarative YAML. Customization belongs in config (model alias, columns, seed data, output projection) — keep `step.py` generic. ## Live Repo Verification Read `../CATALOG.md`, `../ARTIFACTS.md`, and `../COMMANDS.md` before this pack. After the bundled references select `sdg/data_designer`, verify: - Step manifest: `src/nemotron/steps/sdg/data_designer/step.toml` - Per-step README: `src/nemotron/steps/sdg/data_designer/README.md` - Default cfg (SFT): `src/nemotron/steps/sdg/data_designer/config/default.yaml` - Smoke cfg: `src/nemotron/steps/sdg/data_designer/config/tiny.yaml` - Preference cfg (DPO): `src/nemotron/steps/sdg/data_designer/config/rl_pref.yaml` - Tool-call SFT cfg: `src/nemotron/steps/sdg/data_designer/config/customer_support_tools.yaml` - Seed data: `src/nemotron/steps/sdg/data_designer/data/<*>.jsonl` ## Step.toml contract - Consumes: `training_jsonl` (optional — high-quality seed records that anchor generation). - Produces: `synthetic_jsonl` (chat or preference, depending on the chosen pipeline). Manifest defaults: - `num_records = 1000`. - `seed_dataset.path` — path to seed JSONL referenced by `seed`-typed columns. ## Config shape The default SFT config uses env-var defaulting for the output dir: ```yaml output_dir: ${oc.env:SDG_OUTPUT_DIR,${oc.env:NEMO_RUN_DIR,${oc.env:PWD}/output}/sdg} output_path: ${output_dir}/sft.jsonl num_records: 1000 seed_dataset: path: ${oc.env:PWD}/src/nemotron/steps/sdg/data_designer/data/sft_topic_seeds.jsonl strategy: shuffle # shuffle | ordered sampler_with_replacement: false models: - alias: nvidia-text model: nvidia/nemotron-3-nano-30b-a3b provider: nvidia # cloud (NVIDIA_API_KEY) | openai (vLLM/OpenAI-compatible local) skip_health_check: true inference_parameters: temperature: 0.8 top_p: 1.0 max_tokens: 1024 columns: [] output_projection: type: openai_messages ``` Use `--preview` for prompt/column iteration before generating at scale via `client.create()`. Seed columns (e.g. `topic`) are added automatically when `seed_dataset` is set; reference them in prompts as `{{ topic }}` without declaring them explicitly. ## Output projections The repo supports three projection patterns, each ships as its own config: | Projection | Config | Output JSONL | |---|---|---| | `openai_messages` | `default.yaml` | `{"messages": [{"role": ..., "content": ...}]}` | | `dpo_preference` | `rl_pref.yaml` | `{"prompt": "...", "chosen": "...", "rejected": "..."}` | | `structured_messages` | `customer_support_tools.yaml` | `{"messages": [...], "tools": [...]}` | Use `structured_messages` for tool-calling SFT data. ## Customer-support tool calls (`customer_support_tools.yaml`) Generates multi-turn ecommerce support conversations with: - OpenAI-style `messages`. - A `tools` array. - Exactly one assistant `tool_calls` message. - Exactly one matching `tool` response. - Final assistant answer grounded in the tool result. Validation checks for tool-call data (run before SFT): - Every assistant tool call has a matching `tool_call_id`. - Tool arguments are JSON strings, **not** nested objects (OpenAI compatibility). - The final assistant answer reflects the tool response and any policy hint. - No markdown in fields meant as customer-support chat content. ## Preference data (`rl_pref.yaml`) Emits `{"prompt", "chosen", "rejected"}`. Flows into: ``` sdg/data_designer → data_prep/rl_prep → rl/nemo_rl/dpo ``` ## SFT data (`default.yaml`) Emits `{"messages": [...]}`. Flows into: ``` sdg/data_designer → data_prep/sft_packing → sft/megatron_bridge (Megatron-Bridge SFT) sdg/data_designer → sft/automodel (AutoModel SFT, no packing) ``` Use `data_prep/sft_packing` only for Megatron-Bridge SFT. AutoModel reads JSONL directly. ## Patterns to cite - `sdg-pipeline-versioning` in `../PATTERNS.md` — version SDG configs alongside the data they produce. - `data-quality-before-quantity` in `../PATTERNS.md` — small + good beats large + noisy for synthetic data. ## Staleness checks When this pack drifts: - Verify projection names (`openai_messages`, `dpo_preference`, `structured_messages`) still match the upstream Data Designer SDK. - Refresh manifest defaults (`num_records`, `seed_dataset.path`) from `step.toml`. - Refresh model alias / provider examples from the `models:` block in the shipped configs. - Keep `step.py` free of customer-support-specific logic. - Add a smoke/preview config for any new synthetic recipe.