Prompt Embeddings with NVIDIA NIM for LLMs#
NVIDIA NIM for Large Language Models (LLMs) supports prompt embeddings (also known as prompt embeds) as a secure alternative input method to traditional text prompts. By allowing applications to use pre-computed embeddings for inference, this feature not only offers greater flexibility in prompt engineering but also significantly enhances privacy and data security. With prompt embeddings, sensitive user data can be transformed into embeddings before ever reaching the inference server, reducing the risk of exposing confidential information during the AI workflow.
Important
Prompt embedding is experimental and subject to change in future releases.
This feature is particularly useful for:
Privacy-Preserving AI: Transform sensitive prompts into embeddings before sending them to the inference server, protecting confidential information while maintaining model performance.
Custom Embedding Models: Use specialized embedding models that may be better suited for your domain or use case.
Embedding Caching: Pre-compute and cache embeddings for frequently used prompts to improve efficiency.
Advanced Prompt Engineering: Implement sophisticated prompt preprocessing pipelines that modify embeddings before inference.
Multi-Stage Pipelines: Integrate with proxy services or transformation layers that operate on embeddings rather than text.
For background details on prompt embedding, refer to the vLLM Prompt Embeds documentation.
Note
Prompt embedding is supported only through the Completions API (/v1/completions).
Architecture#
This architecture enables privacy-preserving inference by ensuring that sensitive prompts are transformed into embeddings before they reach the inference server (Client → Proxy → NIM → Proxy → Client).
Three main components work together to process prompt embeddings securely:
Client: Sends OpenAI API-compliant requests containing text prompts through the Completions API (
/v1/completions) to the Transform Proxy.Transform Proxy: Acts as an intermediary that:
Validates incoming requests
Transforms text prompts into embeddings
Forwards the transformed embeddings to the upstream inference server
Processes and streams responses back to the client in the standard OpenAI API format.
Upstream Inference Server (NVIDIA LLM NIM): Receives inference requests with prompt embeddings and returns generated responses to the Transform Proxy.
The following diagram illustrates the flow of prompt embeddings through a typical deployment with a transform proxy:
Prompt Embeds Deployment Architecture with Transform Proxy#
Usage#
Prerequisites#
Before using prompt embedding, ensure you have the following:
A compatible NIM container and backend
Access to the model’s embedding layer or a compatible embedding model
Python environment with the following packages:
torch(for tensor operations)transformers(if generating embeddings from Hugging Face models)openai(for API client)
Note
Prompt embeddings are currently supported only with the vLLM backend (using the V0 engine). vLLM v1 support will be available in a future release. Dynamo mode, TRT-LLM, and SGLang are not supported.
Running NIM with Prompt Embeddings#
Set NIM_ENABLE_PROMPT_EMBEDS=1 to enable prompt embeddings support in NIM. When enabled, NIM automatically uses the vLLM V0 engine and disables incompatible features (for example, chunked prefill).
Launch the NIM container with the required environment variable:
docker run -it --rm --gpus all -p 8000:8000 \
-e NGC_API_KEY=$NGC_API_KEY \
-e NIM_ENABLE_PROMPT_EMBEDS=1 \
-e NIM_MODEL_PROFILE=<profile tags or profile hash> \
-v ~/.cache:/opt/nim/.cache \
--shm-size=16GB \
$IMG_NAME
Embedding Limitations#
Same model: Embeddings must come from the exact same model that NIM is serving (for example, via
model.get_input_embeddings()).Format: Provide PyTorch tensors serialized with
torch.save()and base64-encoded inprompt_embeds.Dimensions: The tensor’s last dimension must match the model’s hidden size (for example, 4096 for Llama-3.1-8B).
API Examples#
Request Format#
Prompt embeddings are sent through the /v1/completions endpoint using the prompt_embeds field:
{
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
"prompt": "",
"prompt_embeds": "my-base64-encoded-tensor-string",
"max_tokens": 100,
"temperature": 0.1,
"stream": false
}
Parameter |
Type |
Required |
Description |
|---|---|---|---|
|
string |
Yes |
The model identifier to use for inference |
|
string |
No* |
Text prompt. Use an empty string ( |
|
string |
No* |
Base64-encoded PyTorch tensor containing pre-computed embeddings |
|
integer |
No |
Maximum number of tokens to generate |
|
float |
No |
Sampling temperature (0.0 to 2.0) |
|
boolean |
No |
Whether to stream the response |
Note
*Either prompt or prompt_embeds must be provided; they cannot be used together.
Refer to Prompt Embedding for more details.
Response Format#
The response format is identical to standard completion requests:
{
"id": "cmpl-...",
"object": "text_completion",
"created": 1234567890,
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
"choices": [
{
"text": "Generated text...",
"index": 0,
"logprobs": null,
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 0,
"completion_tokens": 42,
"total_tokens": 42
}
}
Note
When using prompt embeddings, prompt_tokens in the usage statistics can be 0, as the token count is not computed from the embeddings.
Request Example#
You can also use curl to send prompt embeddings directly:
curl -X POST http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
"prompt": "",
"prompt_embeds": "'"$ENCODED_EMBEDDINGS"'",
"max_tokens": 100,
"temperature": 0.1
}'
Where $ENCODED_EMBEDDINGS is a base64-encoded PyTorch tensor.
Python Examples#
Privacy-Preserving Example#
Prompt embeddings support privacy-preserving AI by converting sensitive prompts into embeddings before sending them to the inference server. For example, with frameworks like Stained Glass Transform Proxy, the prompt text is transformed into embeddings, which are then forwarded to NIM for inference. To use prompt embeds with the OpenAI client, include the base64-encoded embeddings in the prompt_embeds field using the extra_body argument, as shown in the following code snippet.
# Send prompt embeds to NIM
completion = client.completions.create(
model=model_name,
max_tokens=100,
temperature=0.1,
prompt="",
extra_body={"prompt_embeds": transformed_embeds},
)
Client Example#
The following sample script, test_prompt_embeds.py, demonstrates the end-to-end process for using prompt embeds with NIM. The script shows how to:
Generate prompt embeddings using HuggingFace Transformers
Encode those embeddings in base64
Send the encoded embeddings to an LLM NIM instance (with a vLLM backend) using the OpenAI-compatible Completions API
To run this integration test, you must first build and run the NIM container, and then execute the script. Follow the steps below:
# Build and run NIM container first:
export GITLAB_API_TOKEN=###
export NGC_API_KEY=###
export URM_ID_TOKEN=###
echo "$GITLAB_API_TOKEN" | docker login gitlab-master.nvidia.com:5005 --username $USER --password-stdin
echo "$NGC_API_KEY" | docker login nvcr.io --username '$oauthtoken' --password-stdin
echo "$URM_ID_TOKEN" | docker login urm.nvidia.com --username $USER --password-stdin
# Build base image from project root
docker build --no-cache --secret id=gitlab_token,env=GITLAB_API_TOKEN --build-arg GITLAB_USER="__token__" -t nim-base -f docker/Dockerfile .
# Build model-specific image
docker build --no-cache \
--build-arg NIM_MODEL_NAME=meta-llama/Meta-Llama-3.1-8B-Instruct \
--build-arg NIM_BASE=nim-base:latest \
-f docker/Dockerfile.nim \
-t nim-llm-llama31-8b .
# Run locally built image
docker run -it --rm --gpus all -p 8000:8000 \
-e NGC_API_KEY=$NGC_API_KEY \
-e NIM_ENABLE_PROMPT_EMBEDS=1 \
-e NIM_MODEL_PROFILE=<profile tags or profile hash> \
-v ~/.cache:/opt/nim/.cache \
--shm-size=16GB \
nim-llm-llama31-8b
# Run the test:
python test_prompt_embeds.py
Refer to the following code snippet for the content of test_prompt_embeds.py:
import base64
import io
import torch
import transformers
from openai import OpenAI
def main():
client = OpenAI(
api_key="EMPTY",
base_url="http://localhost:8000/v1",
)
model_name = "meta-llama/Meta-Llama-3.1-8B-Instruct"
# Transformers
tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)
transformers_model = transformers.AutoModelForCausalLM.from_pretrained(model_name)
# Refer to the HuggingFace repo for the correct format to use
chat = [{"role": "user", "content": "Please tell me about the capital of France."}]
token_ids = tokenizer.apply_chat_template(chat, add_generation_prompt=True, return_tensors="pt")
embedding_layer = transformers_model.get_input_embeddings()
prompt_embeds = embedding_layer(token_ids).squeeze(0)
# Prompt embeds
buffer = io.BytesIO()
torch.save(prompt_embeds, buffer)
buffer.seek(0)
binary_data = buffer.read()
encoded_embeds = base64.b64encode(binary_data).decode("utf-8")
completion = client.completions.create(
model=model_name,
# NOTE: The OpenAI client does not allow `None` as an input to
# `prompt`. Use an empty string if you have no text prompts.
prompt="",
max_tokens=10,
temperature=0.0,
# NOTE: The OpenAI client allows passing in extra JSON body via the
# `extra_body` argument.
extra_body={"prompt_embeds": encoded_embeds},
)
print("-" * 30)
print(completion.choices[0].text)
print("-" * 30)
if __name__ == "__main__":
main()