# Curator Processing, Language, And Quality Context

Use this pack for `curate/nemo_curator` when configuring filtering after input
loading has been verified.

## Product Contract

- Keep this step simple: read JSONL, optionally apply language/domain/word-count
  gates, write filtered JSONL.
- Do not add dedup, custom classifiers, or heavy processing unless the current
  step exposes it or the user approves a new catalog step.

## Filter Controls

| Need | Config |
|---|---|
| Preserve all records for smoke test | `language_codes=[]`, `domains=[]`, `quality_filters={}` |
| Language gating | `language_codes=[...]`, `models.fasttext_langid`, optional `quality_filters.min_langid_score` |
| Word-count gate | set both `quality_filters.min_words` and `quality_filters.max_words` |
| Domain gate | set `domains=[...]` and optional `models.hf_cache_dir` |

## Practical Defaults

- Start with a tiny sample and permissive filters.
- Add one filter family at a time so failures are attributable.
- Keep `text_field` aligned with the input schema.
- Record filter thresholds in the generated project config; they materially
  affect downstream data quality.

## Remote Runtime Notes

- Language and domain models may need cache directories available on the remote
  filesystem.
- For CPU-only curation profiles, constrain Ray CPU count instead of relying on
  all machine CPUs.
- If output is unexpectedly empty, inspect the intermediate record counts before
  changing downstream training configs.

## Failure Modes

- `missing_language_model`: disable language filtering or provide the FastText
  model path.
- `incomplete_word_filter`: provide both min and max word thresholds or remove
  both.
- `empty_or_tiny_output`: relax filters and inspect a few rejected examples.
