# Skills Eval Benchmark

Generated: 2026-05-21 03:35:00 UTC
Specs: 2

---

## Skill Eval — `skills/vss-deploy-detection-tracking-2d/evals/deploy-evals.json`

Head: `6b5178ca` · 1 platform · spec `2ad18b80`
First started: `2026-05-21T02:29:39Z` · Last finished: `2026-05-21T03:00:48Z` · Total: `31m 9s`

| Platform | Step | Query | Result | Reward | Duration | Turns | Prompt tok | Cached tok | Trace |
|---|---|---|---|---|---|---|---|---|---|
| L40S | step-1 | Deploy rtvi-cv. | ✅ 1.0 (3/3) | 1.0 | 27m 13s | 200 | 12.1M | 11.9M | trace |
| L40S | step-2 | Stop rtvi-cv and clean up the deployment. | ✅ 1.0 (4/4) | 1.0 | 2m 43s | 25 | 612.9k | 597.6k | trace |

This spec exercises the DEPLOY/TEARDOWN flow of the `vss-deploy-detection-tracking-2d` skill directly (no `/vss-deploy-profile` prerequisite — the skill launches its own `rtvicv-perception-docker` container via `docker run`). Step 1 (deploy): the agent correctly acknowledged the use-case dimension and deployed without fabricating unsupported use-case names — 200 turns / 27m for a cold-box deploy including NGC auth + image pull. Step 2 (teardown): the agent confirmed the container stopped; NGC credentials preserved after teardown.

<sub>Generated by the skills-eval agent. Trial datasets/results live in the workflow artifact at `skills-eval-results-pr-511-26201715368.tar.gz`.</sub>

---

## Skill Eval — `skills/vss-deploy-detection-tracking-2d/evals/usage-evals.json`

Head: `6b5178ca` · 1 platform · spec `4af9517e`
First started: `2026-05-21T03:01:16Z` · Last finished: `2026-05-21T03:35:00Z` · Total: `33m 44s`

| Platform | Step | Query | Result | Reward | Duration | Turns | Prompt tok | Cached tok | Trace |
|---|---|---|---|---|---|---|---|---|---|
| L40S | step-1 | Add a stream `file:///...sample_1080p_h264.mp4` with id `cam_entrance` to rtvi-cv. | ✅ 1.0 (4/4) | 1.0 | 7m 46s | 82 | 3.3M | 3.2M | trace |
| L40S | step-2 | Run a full health check on rtvi-cv — verify liveness, readiness, and startup probes. | ✅ 1.0 (4/4) | 1.0 | 2m 42s | 15 | 297.2k | 262.4k | trace |
| L40S | step-3 | What is the FPS on all streams? Get rtvi-cv metrics. | ✅ 1.0 (3/3) | 1.0 | 4m 5s | 21 | 488.5k | 471.1k | trace |
| L40S | step-4 | List all active streams in rtvi-cv. | ✅ 1.0 (3/3) | 1.0 | 2m 40s | 18 | 369.6k | 359.3k | trace |
| L40S | step-5 | Remove a stream from rtvi-cv. | ✅ 1.0 (4/4) | 1.0 | 5m 29s | 19 | 405.8k | 395.3k | trace |

All 5 usage steps passed (18/18) against a live `rtvicv-perception-docker` container at `http://localhost:9000/api/v1`: stream add, health probes (`/live`, `/ready`, `/startup`), metrics, stream list, and stream remove. The mandatory container-alive precheck in the spec was correctly followed — each step verified container health before executing the query. No fake credentials were fabricated in any response.

<sub>Generated by the skills-eval agent. Trial datasets/results live in the workflow artifact at `skills-eval-results-pr-511-26201715368.tar.gz`.</sub>

---
