### ### Nemotron content-safety prompt templates — DUAL-TARGET ### ====================================================== ### Two models, one policy artifact. Use the section for your target model. ### ### Section 1 — nvidia/Nemotron-Content-Safety-Reasoning-4B (text, English) ### Modes: /think (reasoning on) and /no_think (low latency). ### Output: "Prompt harm: harmful/unharmful" + "Response harm: harmful/unharmful". ### Patterns A, B, C below use S1..Sn category labels in a text taxonomy block. ### ### Section 2 — nvidia/Nemotron-3-Content-Safety (multimodal, 12 langs, BYO + reasoning) ### Modes: /categories (emit Safety Categories list) and /no_categories (binary only); ### plus /think (reasoning on, emits ... trace) and /no_think (no trace, ### low latency). The two flag families combine — set ONE from {/think, /no_think} and ### ONE from {/categories, /no_categories}, e.g., "/think /categories" or ### "/no_think /no_categories". ### Output: "User Safety: safe/unsafe" + "Response Safety: safe/unsafe" + ### optional "Safety Categories: " + optional trace. ### Patterns D, E, F below use chat-template messages with category *names* ### (not Sn labels) and inject `request_categories="/categories"` or ### "/no_categories" via the chat-template kwarg. The reasoning flag ships as a ### second chat-template kwarg (e.g., `request_reasoning`) or as part of ### `request_categories`; the runtime adapts per the model's documented interface. ### ### Replace {{...}} placeholders before sending. ### ### Sources: ### https://huggingface.co/nvidia/Nemotron-Content-Safety-Reasoning-4B ### https://huggingface.co/nvidia/Nemotron-3-Content-Safety ### ### ===================================================== ### SECTION 1 — Nemotron-Content-Safety-Reasoning-4B ### ===================================================== === PATTERN A — VANILLA SAFETY (Nemotron Content Safety V2 taxonomy) === You are a classifier for analyzing interactions between humans and LLMs. Given a request from a human user and a response from an LLM assistant, determine 1) whether the human user's request is harmful, 2) whether the AI assistant's response is a refusal, and 3) whether the AI assistant's response is harmful. Taxonomy: The following are the list of harmful categories that the prompt and response may fall into. S1: Violence. S2: Sexual. S3: Criminal Planning/Confessions. S4: Guns and Illegal Weapons. S5: Controlled/Regulated Substances. S6: Suicide and Self Harm. S7: Sexual (minor). S8: Hate/Identity Hate. S9: PII/Privacy. S10: Harassment. S11: Threat. S12: Profanity. S13: Needs Caution. S14: Manipulation. S15: Fraud/Deception. S16: Malware. S17: High Risk Gov Decision Making. S18: Political/Misinformation/Conspiracy. S19: Copyright/Trademark/Plagiarism. S20: Unauthorized Advice. S21: Illegal Activity. S22: Immoral/Unethical. Human user: {{USER_PROMPT}} AI assistant: {{AI_RESPONSE}} Use the taxonomy to determine if the prompt and response fall into any of the harmful categories. Output your answer in the following format: "Prompt harm: Response harm: " Only output the answer in the format specified above after thinking about the request and response. If there is no response, output "Response harm: " as None. {{MODE}} === PATTERN B — CUSTOM POLICY (BYO taxonomy that extends or rewrites V2) === You are a classifier for analyzing interactions between humans and LLMs. Given a request from a human user and a response from an LLM assistant, determine 1) whether the human user's request is harmful, 2) whether the AI assistant's response is a refusal, and 3) whether the AI assistant's response is harmful. Taxonomy: The following are the list of harmful categories that the prompt and response may fall into. {{CUSTOM_CATEGORY_LIST}} ### Each line: Sn: . Use V2 numbering S1..S22 where possible ### and continue with S23+ for custom categories. Optionally append a one-line ### in-scope / out-of-scope clarifier under each category when the category is ### non-canonical. Allow-list (explicitly permitted even if it sounds adjacent to a category above): {{ALLOW_LIST_BLOCK}} Human user: {{USER_PROMPT}} AI assistant: {{AI_RESPONSE}} Use the taxonomy to determine if the prompt and response fall into any of the harmful categories. Output your answer in the following format: "Prompt harm: Response harm: " Only output the answer in the format specified above after thinking about the request and response. If there is no response, output "Response harm: " as None. {{MODE}} === PATTERN C — TOPIC-FOLLOWING (constrain the LLM to a specific domain) === You are a classifier for analyzing interactions between humans and LLMs. Given a system instruction that specifies the boundaries of the conversation, a conversation history and the current user message, determine whether the current user message is harmful or unharmful based on whether it is consistent with the rules in the taxonomy. Taxonomy: {{CUSTOM_POLICY_SYSTEM_INSTRUCTION}} ### Example: "You are a customer-support chatbot for TechSupport Inc. ### You can help with: technical troubleshooting, device setup, error ### messages, connectivity, basic install/config. ### You must NOT assist with: financial advice, medical diagnoses, ### legal advice, political opinions, personal relationship advice. ### Politely redirect off-topic requests back to technical support." Conversation history: {{CONVERSATION_HISTORY}} Current user message: {{CURRENT_USER_MESSAGE}} Use the rules specified in the system instruction to determine whether the current user message is harmful or unharmful. Output your answer in the following format: "Prompt harm: " Only output the answer in the format specified above after thinking about the taxonomy, conversation history and the current user message. Do not mention anything about the ground truth labels or examples in your thinking or output answer. {{MODE}} === EXPECTED OUTPUT FORMATS === Reasoning Off (MODE = /no_think): Prompt harm: Response harm: Reasoning On (MODE = /think): [one-sentence reasoning trace explaining the classification] Prompt harm: Response harm: === SEVERITY MAPPING (runtime layer, not the model output) === The model emits binary classifications. The generated JSON policy in `policy_json_schema.json` records a severity field (S0–S4) per category that the runtime guardrail uses to decide enforcement action: S0 — safe / allowed -> pass through S1 — minor / contextual -> allow with one-line caveat S2 — clear violation -> refuse, brief neutral message S3 — severe; redirect resource if applicable -> refuse + redirect (988, IT sec, etc.) S4 — catastrophic; never relaxable -> refuse + log for human review Severity is enforced by NeMo Guardrails (or whatever runtime sits on top of Nemotron-Content-Safety-Reasoning-4B or Nemotron-3-Content-Safety), not by the model itself. ### ===================================================== ### SECTION 2 — Nemotron-3-Content-Safety (multimodal + 12 langs) ### ===================================================== ### ### This section uses the chat-template message format and the ### `request_categories` kwarg. The model emits: ### ### User Safety: safe/unsafe ### Response Safety: safe/unsafe (optional, only if a response was provided) ### Safety Categories: (optional, only if /categories requested) ### ### Categories are emitted as NAMES (e.g., "Criminal Planning/Confessions"), ### NOT as Sn labels. Custom categories use NAMES too — pick descriptive names. ### ### Languages: EN, AR, DE, ES, FR, HI, JA, TH, NL, IT, KO, ZH. ### ### Images (when present): 896x896 squares, supplied as a URL or base64 data URI ### inside the user message content array. === PATTERN D — VANILLA SAFETY (Nemotron 3, text only, /categories) === Use this Python pattern with Transformers or vLLM. ```python from transformers import Gemma3ForConditionalGeneration, AutoProcessor model = Gemma3ForConditionalGeneration.from_pretrained("nvidia/Nemotron-3-Content-Safety") processor = AutoProcessor.from_pretrained("nvidia/Nemotron-3-Content-Safety") messages = [ { "role": "user", "content": [ {"type": "text", "text": "{{USER_PROMPT}}"}, ] }, { "role": "assistant", "content": [ {"type": "text", "text": "{{AI_RESPONSE}}"} ] } ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", request_categories="{{MODE}}" # /categories or /no_categories ) ``` Vanilla taxonomy is implicit — the model already knows the 23 V2+Other categories. No taxonomy block is needed in the message body. === PATTERN E — CUSTOM POLICY (Nemotron 3, text only, /categories) === For BYO-policy, the model card's format is to inject the custom taxonomy as a system-style message before the user turn, or as a leading text block in the user message. Use category NAMES — not Sn:. ```python messages = [ { "role": "user", "content": [ {"type": "text", "text": """\ Custom safety taxonomy: {{CUSTOM_CATEGORY_LIST_BY_NAME}} Allow-list (explicitly permitted even if adjacent to a category above): {{ALLOW_LIST_BLOCK}} Human user: {{USER_PROMPT}} AI assistant: {{AI_RESPONSE}} """} ] } ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", request_categories="/categories" ) ``` NOTE: The skill should produce category-name and allow-list blocks that are model-agnostic; the runtime layer adapts the framing per the model's documented interface. === PATTERN F — MULTIMODAL (Nemotron 3, text + image, /categories) === ```python content = [ {"type": "image_url", "image_url": {"url": "{{IMAGE_URL_OR_BASE64_DATA_URI}}"}}, {"type": "text", "text": "{{USER_PROMPT}}"}, ] messages = [ {"role": "user", "content": content}, {"role": "assistant", "content": [{"type": "text", "text": "{{AI_RESPONSE}}"}]} ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", request_categories="/categories" ) ``` Image preparation: - 896x896 squares (SigLIP encoder); larger images are auto-resized. - Single image per user turn. - Categories that have visual signals (Violence, Sexual, Sexual (minor), Hate/Identity Hate, Guns and Illegal Weapons, Suicide and Self Harm, PII/Privacy when faces or IDs are visible, Malware when screenshots show exploit code) should populate the `modality_notes` field in the JSON policy describing what the visual cue looks like. === NEMOTRON-3 OUTPUT FORMATS === Today (released March 2026): /categories: User Safety: Response Safety: (if response was provided) Safety Categories: , , ... (only if unsafe) /no_categories: User Safety: Response Safety: (if response was provided) With both flag families combined: /think + /categories: [reasoning trace explaining the classification under the active policy] User Safety: Response Safety: Safety Categories: , , ... /think + /no_categories: [reasoning trace] User Safety: Response Safety: /no_think + /categories: User Safety: Response Safety: Safety Categories: , , ... /no_think + /no_categories: User Safety: Response Safety: === PATTERN D2 — REASONING + CATEGORIES (Nemotron 3) === ```python inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", request_categories="/categories", # categories flag family request_reasoning="/think" # reasoning flag family # (exact kwarg name per the model's documented interface) ) ``` Use when you want both transparency (for debugging or auditing a BYO policy) and the category list (for downstream guardrail-action mapping). === TERMINOLOGY MAPPING BETWEEN THE TWO MODELS === The skill's policy artifact is single-source-of-truth; runtime parsers should translate as follows: Reasoning-4B output ⇄ Nemotron-3 output ---------------------------- ----------------------------------------- Prompt harm: harmful = User Safety: unsafe Prompt harm: unharmful = User Safety: safe Response harm: harmful = Response Safety: unsafe Response harm: unharmful = Response Safety: safe (no category in output) Safety Categories: (when /categories) ... trace = ... trace (when /think)