# Standard Evaluation Context Use this pack when configuring `eval/model_eval` benchmark selection and endpoint behavior. ## Product Contract - Evaluation consumes an already trained/deployed checkpoint or endpoint and produces benchmark results. - Do not treat tiny or limited-sample runs as quality evidence. They only prove wiring. - Use checked-in step configs and the evaluator runner before inventing a new launcher. ## Benchmark Selection | Goal | Benchmark shape | |---|---| | Instruction following | chat endpoint, deterministic generation, tasks like IFEval | | Knowledge/reasoning | chat endpoint, larger `max_new_tokens`, model-card decoding defaults | | Multiple-choice logprob | completions endpoint with logprobs and tokenizer | | Regression smoke | tiny subset or small benchmark list, explicitly marked non-comparable | ## Required Runtime Inputs - `evaluation.tasks`: concrete NeMo Evaluator Launcher task entries. - `target.api_endpoint.url`: OpenAI-compatible endpoint URL when `deployment.type=none`. - `target.api_endpoint.type`: `chat` or `completions`. - Tokenizer path/model handle for logprob benchmarks. - API key environment variable when the endpoint requires auth. ## Rules - Match endpoint type to benchmark type. Chat tasks should not be forced through logprob completions, and logprob tasks need completions/logprobs support. - Keep generation budgets explicit for reasoning tasks. - Preserve result directories per run so before/after comparisons do not overwrite each other. ## Failure Modes - `wrong_endpoint_type`: switch chat/completions to match the task. - `missing_tokenizer_for_logprobs`: provide tokenizer path or choose chat tasks. - `no_endpoint`: deploy first or provide an existing endpoint URL.