# Evaluation Context: Container-Backed Benchmarks Use this pack when configuring `eval/model_eval` for NeMo Evaluator Launcher tasks that are owned by an evaluator container, including sovereign, multilingual, custom-language, standard English, tool, or agent benchmarks. ## Launcher Contract Evaluator Launcher task entries can include: - `name`: exact task id from `nemo-evaluator-launcher ls tasks` or `nemo-evaluator-launcher ls task `. - `container`: evaluation image that owns the task metadata. - `endpoint_type`: `chat`, `completions`, or logprob-compatible completions. The task container is the source of truth for benchmark metadata. Do not duplicate every task definition in Nemotron code. Do not construct task names as `.` unless the launcher or task container lists that exact dotted id. ## Endpoint Selection Ask for model id, endpoint URL, API key environment variable name, endpoint capability, target language, benchmark container image, and smoke versus full run. Use `deployment.type=none` for hosted endpoints. ## Benchmark Selection Pick tasks by target language and endpoint capability, not by model origin. A sovereign or region-specific model can still run standard English benchmarks when the user wants English capability measurement. Standard English smoke task ids: - `adlr_mmlu` with a completions endpoint. - `hellaswag` with a completions endpoint that supports logprobs, plus the evaluated model tokenizer. - `mmlu_instruct` with a chat endpoint. Sovereign/Indic examples: - `sovereign.gsm8k_indic_hi` - `sovereign.mmlu_indic_hi` - `sovereign.indicgenbench_flores_in_hi` Indic language codes include `hi`, `bn`, `gu`, `kn`, `mr`, `ml`, `or`, `pa`, `ta`, and `te`. Use `_completions` variants for completions-only endpoints and `_logprob` variants only after verifying logprob support. ## Metrics - GSM8K and chat MCQ tasks: pass-at-1 correctness metric (`pass` at sample 1) - MMLU-style logprob tasks: `acc` - ARC/BoolQ logprob tasks: `acc_norm` - FLORES translation: `chrf` - CrossSum summarization: `rouge_l`