Prompt Embeddings with NVIDIA NIM for LLMs#

NVIDIA NIM for Large Language Models (LLMs) supports prompt embeddings (also known as prompt embeds) as a secure alternative input method to traditional text prompts. By allowing applications to use pre-computed embeddings for inference, this feature not only offers greater flexibility in prompt engineering but also significantly enhances privacy and data security. With prompt embeddings, sensitive user data can be transformed into embeddings before ever reaching the inference server, reducing the risk of exposing confidential information during the AI workflow.

Important

Prompt embedding is experimental and subject to change in future releases.

This feature is particularly useful for:

  • Privacy-Preserving AI: Transform sensitive prompts into embeddings before sending them to the inference server, protecting confidential information while maintaining model performance.

  • Custom Embedding Models: Use specialized embedding models that may be better suited for your domain or use case.

  • Embedding Caching: Pre-compute and cache embeddings for frequently used prompts to improve efficiency.

  • Advanced Prompt Engineering: Implement sophisticated prompt preprocessing pipelines that modify embeddings before inference.

  • Multi-Stage Pipelines: Integrate with proxy services or transformation layers that operate on embeddings rather than text.

For background details on prompt embedding, refer to the vLLM Prompt Embeds documentation.

Note

Prompt embedding is supported only through the Completions API (/v1/completions).

Architecture#

This architecture enables privacy-preserving inference by ensuring that sensitive prompts are transformed into embeddings before they reach the inference server (Client → Proxy → NIM → Proxy → Client).

Three main components work together to process prompt embeddings securely:

  1. Client: Sends OpenAI API-compliant requests containing text prompts through the Completions API (/v1/completions) to the Transform Proxy.

  2. Transform Proxy: Acts as an intermediary that:

    • Validates incoming requests

    • Transforms text prompts into embeddings

    • Forwards the transformed embeddings to the upstream inference server

    • Processes and streams responses back to the client in the standard OpenAI API format.

  3. Upstream Inference Server (NVIDIA LLM NIM): Receives inference requests with prompt embeddings and returns generated responses to the Transform Proxy.

The following diagram illustrates the flow of prompt embeddings through a typical deployment with a transform proxy:

_images/prompt-embeds-architecture.png

Prompt Embeds Deployment Architecture with Transform Proxy#

Usage#

Prerequisites#

Before using prompt embedding, ensure you have the following:

  • A compatible NIM container and backend

  • Access to the model’s embedding layer or a compatible embedding model

  • Python environment with the following packages:

    • torch (for tensor operations)

    • transformers (if generating embeddings from Hugging Face models)

    • openai (for API client)

Note

Prompt embeddings are currently supported only with the vLLM backend (using the V0 engine). vLLM v1 support will be available in a future release. Dynamo mode, TRT-LLM, and SGLang are not supported.

Running NIM with Prompt Embeddings#

Set NIM_ENABLE_PROMPT_EMBEDS=1 to enable prompt embeddings support in NIM. When enabled, NIM automatically uses the vLLM V0 engine and disables incompatible features (for example, chunked prefill).

Launch the NIM container with the required environment variable:

docker run -it --rm --gpus all -p 8000:8000 \
  -e NGC_API_KEY=$NGC_API_KEY \
  -e NIM_ENABLE_PROMPT_EMBEDS=1 \
  -e NIM_MODEL_PROFILE=<profile tags or profile hash> \
  -v ~/.cache:/opt/nim/.cache \
  --shm-size=16GB \
  $IMG_NAME

Embedding Limitations#

  • Same model: Embeddings must come from the exact same model that NIM is serving (for example, via model.get_input_embeddings()).

  • Format: Provide PyTorch tensors serialized with torch.save() and base64-encoded in prompt_embeds.

  • Dimensions: The tensor’s last dimension must match the model’s hidden size (for example, 4096 for Llama-3.1-8B).

API Examples#

Request Format#

Prompt embeddings are sent through the /v1/completions endpoint using the prompt_embeds field:

{
  "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
  "prompt": "",
  "prompt_embeds": "my-base64-encoded-tensor-string",
  "max_tokens": 100,
  "temperature": 0.1,
  "stream": false
}
Request Parameters#

Parameter

Type

Required

Description

model

string

Yes

The model identifier to use for inference

prompt

string

No*

Text prompt. Use an empty string ("") when providing only prompt_embeds.

prompt_embeds

string

No*

Base64-encoded PyTorch tensor containing pre-computed embeddings

max_tokens

integer

No

Maximum number of tokens to generate

temperature

float

No

Sampling temperature (0.0 to 2.0)

stream

boolean

No

Whether to stream the response

Note

*Either prompt or prompt_embeds must be provided; they cannot be used together.

Refer to Prompt Embedding for more details.

Response Format#

The response format is identical to standard completion requests:

{
  "id": "cmpl-...",
  "object": "text_completion",
  "created": 1234567890,
  "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
  "choices": [
    {
      "text": "Generated text...",
      "index": 0,
      "logprobs": null,
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 0,
    "completion_tokens": 42,
    "total_tokens": 42
  }
}

Note

When using prompt embeddings, prompt_tokens in the usage statistics can be 0, as the token count is not computed from the embeddings.

Request Example#

You can also use curl to send prompt embeddings directly:

curl -X POST http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "prompt": "",
    "prompt_embeds": "'"$ENCODED_EMBEDDINGS"'",
    "max_tokens": 100,
    "temperature": 0.1
  }'

Where $ENCODED_EMBEDDINGS is a base64-encoded PyTorch tensor.

Python Examples#

Privacy-Preserving Example#

Prompt embeddings support privacy-preserving AI by converting sensitive prompts into embeddings before sending them to the inference server. For example, with frameworks like Stained Glass Transform Proxy, the prompt text is transformed into embeddings, which are then forwarded to NIM for inference. To use prompt embeds with the OpenAI client, include the base64-encoded embeddings in the prompt_embeds field using the extra_body argument, as shown in the following code snippet.

# Send prompt embeds to NIM
completion = client.completions.create(
    model=model_name,
    max_tokens=100,
    temperature=0.1,
    prompt="",
    extra_body={"prompt_embeds": transformed_embeds},
)

Client Example#

The following sample script, test_prompt_embeds.py, demonstrates the end-to-end process for using prompt embeds with NIM. The script shows how to:

  1. Generate prompt embeddings using HuggingFace Transformers

  2. Encode those embeddings in base64

  3. Send the encoded embeddings to an LLM NIM instance (with a vLLM backend) using the OpenAI-compatible Completions API

To run this integration test, you must first build and run the NIM container, and then execute the script. Follow the steps below:

# Build and run NIM container first:
export GITLAB_API_TOKEN=###
export NGC_API_KEY=###
export URM_ID_TOKEN=###


echo "$GITLAB_API_TOKEN" | docker login gitlab-master.nvidia.com:5005 --username $USER --password-stdin
echo "$NGC_API_KEY" | docker login nvcr.io --username '$oauthtoken' --password-stdin
echo "$URM_ID_TOKEN" | docker login urm.nvidia.com --username $USER --password-stdin


# Build base image from project root
docker build --no-cache --secret id=gitlab_token,env=GITLAB_API_TOKEN --build-arg GITLAB_USER="__token__" -t nim-base -f docker/Dockerfile .


# Build model-specific image
docker build --no-cache \
  --build-arg NIM_MODEL_NAME=meta-llama/Meta-Llama-3.1-8B-Instruct \
  --build-arg NIM_BASE=nim-base:latest \
  -f docker/Dockerfile.nim \
  -t nim-llm-llama31-8b .


# Run locally built image
docker run -it --rm --gpus all -p 8000:8000 \
  -e NGC_API_KEY=$NGC_API_KEY \
  -e NIM_ENABLE_PROMPT_EMBEDS=1 \
  -e NIM_MODEL_PROFILE=<profile tags or profile hash> \
  -v ~/.cache:/opt/nim/.cache \
  --shm-size=16GB \
  nim-llm-llama31-8b


# Run the test:
python test_prompt_embeds.py

Refer to the following code snippet for the content of test_prompt_embeds.py:

import base64
import io
import torch
import transformers
from openai import OpenAI

def main():
    client = OpenAI(
        api_key="EMPTY",
        base_url="http://localhost:8000/v1",
    )


    model_name = "meta-llama/Meta-Llama-3.1-8B-Instruct"


    # Transformers
    tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)
    transformers_model = transformers.AutoModelForCausalLM.from_pretrained(model_name)


    # Refer to the HuggingFace repo for the correct format to use
    chat = [{"role": "user", "content": "Please tell me about the capital of France."}]
    token_ids = tokenizer.apply_chat_template(chat, add_generation_prompt=True, return_tensors="pt")


    embedding_layer = transformers_model.get_input_embeddings()
    prompt_embeds = embedding_layer(token_ids).squeeze(0)


    # Prompt embeds
    buffer = io.BytesIO()
    torch.save(prompt_embeds, buffer)
    buffer.seek(0)
    binary_data = buffer.read()
    encoded_embeds = base64.b64encode(binary_data).decode("utf-8")


    completion = client.completions.create(
        model=model_name,
        # NOTE: The OpenAI client does not allow `None` as an input to
        # `prompt`. Use an empty string if you have no text prompts.
        prompt="",
        max_tokens=10,
        temperature=0.0,
        # NOTE: The OpenAI client allows passing in extra JSON body via the
        # `extra_body` argument.
        extra_body={"prompt_embeds": encoded_embeds},
    )

    print("-" * 30)
    print(completion.choices[0].text)
    print("-" * 30)


if __name__ == "__main__":
    main()