# Curator Data Acquisition Context

Use this pack for `curate/nemo_curator` when the user needs to materialize raw
text before downstream curation, translation, pretraining prep, or SFT prep.

## Product Contract

- The current step is a lightweight text curation wrapper. It reads local JSONL
  or an optional Hugging Face snapshot, applies configured filters, and writes
  JSONL.
- Do not implement a full Common Crawl downloader unless the repo step cannot
  satisfy the user request and the user approves Explorer-mode code.
- Keep Curator reader/writer stages as the default I/O path.

## Local JSONL Path

Use this when the user already has files:

- Set `dataset=null`.
- Set `input_glob` to the JSONL file or shard glob visible inside the runtime.
- Set `output_dir` to a new directory.
- Start permissive: `language_codes=[]`, `domains=[]`, `quality_filters={}`.
- Add filters only after reader/writer output is verified.

## Hugging Face Snapshot Path

Use this when the user names a dataset:

- Set `dataset.repo_id`, `dataset.repo_type`, `dataset.local_dir`, and
  `allow_patterns` as needed.
- Point `input_glob` inside `dataset.local_dir`.
- Use only approved `dataset.repo_id` values and pinned revisions when
  production reproducibility or supply-chain risk matters.
- Validate checksums or snapshot metadata when available, scan downloaded
  content before downstream processing, and restrict production outbound network
  access to approved Hugging Face domains.
- Ensure `HF_TOKEN` and `HF_HOME` are available in the runtime when needed.
  Treat `HF_TOKEN` as a secret bearer token: inject it through a secrets manager
  or environment vault, never hardcode, print, echo, or log it, scope access to
  minimum required permissions, and rotate it after shared-environment use.

## Operational Rules

- Split one huge JSONL into shards before Curator reads it if memory pressure is
  expected.
- For Lepton or other remote runs, make sure input/output paths live on a
  mounted shared filesystem.
- Set `ray.num_cpus` in YAML or via env profile when the default CPU count is
  not enough.

## Failure Modes

- `input_glob_no_matches`: verify the path inside the container, not only on
  the submit host.
- `large_file_oom`: shard input before retrying.
- `empty_or_tiny_output`: disable filters first, then re-enable one gate at a
  time.
