---
name: nemo-evaluator-plugin
description: Use when working on the Evaluator plugin CLI, jobs, SDK-backed specs, metric types, or plugin-owned Evaluator skills.
metadata:
  owner: nemo-platform
  maturity: active
license: Apache-2.0
---

# Evaluator Plugin

Use this skill for evaluation tasks against a running NeMo Platform server. The plugin-backed CLI interface is `nemo evaluator`; the legacy generated `nemo evaluation` API command group is not the target surface for new guidance.

## CLI Interface

### Prerequisites

- all commands in this file assume that the shell's working dir is at the root of the Nvidia-NeMo/nemo-platform repo
- activate the Python virtual environment before invoking the `nemo` CLI: `source .venv/bin/activate`

Check plugin status from the CLI:

```bash
nemo evaluator info
```

## Metric Types

### Explore Available Metrics

To view available metric names, run:

```bash
nemo evaluator metric-types
```

To view a specific metric schema, pass a metric name from the `metric_types` list above:

```bash
nemo evaluator metric-types <metric-name>
```

Inspect all the registered metric schema contracts:

```bash
nemo evaluator evaluate explain
```

> Note: use `nemo evaluator evaluate explain` as the source of truth for the current plugin input schema. It will return a large json schema response, so strongly prefer `nemo evaluator metric-types` when you only need metric names and corresponding schemas.

## Evaluation Spec

Evaluation spec is a payload that is provided to CLI as an input to execute evaluation.

At a high level, a spec describes:

- `metrics`: bundled Evaluator SDK metric configurations
- `dataset`: inline rows to evaluate or platform FilesetRef that contains the dataset
- `params`: optional Evaluator SDK execution parameters
- `target`: optional model or agent target for online evaluation

See the LLM-judge spec example at [assets/specs/llm_as_judge.json](./assets/specs/llm_as_judge.json).

### Metric Bundle Payloads

The checked-in [spec examples](./assets/specs) use bundled SDK metrics. The fields under `metrics[*].payload` are generated by `bundle_metric(metric, CloudpickleMetricBundlePackager())`.

To see the pattern for configuring a pre-defined SDK metric, for example `ExactMatchMetric`, and converting it into bundled metric JSON, inspect `build_metric_bundle_example()` in [generate_example_specs.py](./scripts/generate_example_specs.py) and run:

```bash
uv run --frozen python skills/nemo-evaluator-plugin/scripts/generate_example_specs.py
```

## Run Evaluations

### Run Using File Spec Reference

When using the `nemo evaluator evaluate run` command, results are saved into local temporary directories and the link is printed to stdout.
Prefer the `--spec-file` named argument over inline shell JSON because metric bundles include serialized payloads.
Examples of various specs are provided in the [assets/specs](./assets/specs/) directory.

#### Evaluate using `exact-match` metric

See the spec example at [assets/specs/exact_match_metric.json](./assets/specs/exact_match_metric.json).

```bash
nemo evaluator evaluate run --spec-file skills/nemo-evaluator-plugin/assets/specs/exact_match_metric.json
```

#### Evaluate using a benchmark metric set

```bash
nemo evaluator evaluate run --spec-file skills/nemo-evaluator-plugin/assets/specs/exact_match_benchmark.json
```

#### Evaluate using `LLM-Judge` metric

Uses an LLM to score responses. See the spec example at [assets/specs/llm_as_judge.json](./assets/specs/llm_as_judge.json).

```bash
nemo evaluator evaluate run --spec-file skills/nemo-evaluator-plugin/assets/specs/llm_as_judge.json
```

### Run Evaluation As A Durable Job

Use the `nemo evaluator evaluate submit` command to create a durable evaluation job. The response of this command returns a job handler object instead of the evaluation result.

```bash
nemo evaluator evaluate submit \
  --spec-file skills/nemo-evaluator-plugin/assets/specs/exact_match_metric.json
```

The submit response includes the generated job's `name` field, for example `nemo-evaluator-zlhn1ecd`. Wait for the job to complete, then list and download the job results.

```bash
nemo jobs get-status <job-name>
nemo jobs get <job-name>
nemo jobs results list <job-name>
nemo jobs results download aggregate-scores --job <job-name> --output-file aggregate-scores.json
nemo jobs results download row-scores --job <job-name> --output-file row-scores.jsonl
```

## Python SDK Interface

Evaluator Python SDK client is exposed as `evaluator` variable on `NeMoPlatform` instance:

```python
from nemo_platform import NeMoPlatform

platform_client = NeMoPlatform(base_url="http://localhost:8080")
status = platform_client.evaluator.plugin_status()
```

See examples of using the plugin SDK interface in [plugin_sdk_examples.py](./assets/examples/plugin_sdk_examples.py).

## Security
Make sure not to print any secrets to stdout since this can be collected as logs

## Additional Resources

For LLM-judge setup notes, see [LLM Judge Notes](references/llm-judge.md).

For evaluator API key auth, see [Evaluator API Auth](references/api-auth.md).

For local and cluster troubleshooting, see [Evaluation Troubleshooting](references/troubleshooting.md).