From c0ece2453d98b9604cd6f2e708d13b57c6ca9cfb Mon Sep 17 00:00:00 2001
From: Sean Sube <seansube@gmail.com>
Date: Mon, 27 Mar 2023 17:14:10 -0500
Subject: [PATCH] fix(docs): add more runtimes to memory usage table

---
 docs/user-guide.md | 28 +++++++++++++++++++++-------
 1 file changed, 21 insertions(+), 7 deletions(-)

diff --git a/docs/user-guide.md b/docs/user-guide.md
index 3481ac0d..13f59f93 100644
--- a/docs/user-guide.md
+++ b/docs/user-guide.md
@@ -733,19 +733,33 @@ number of [server optimizations](server-admin.md#pipeline-optimizations) that yo
 - `onnx-low-memory`
 - `torch-fp16`
 
+You can enable optimizations using the `ONNX_WEB_OPTIMIZATIONS` environment variable:
+
+```shell
+# on linux:
+> export ONNX_WEB_OPTIMIZATIONS=diffusers-attention-slicing,onnx-fp16,onnx-low-memory
+
+# on windows:
+> set ONNX_WEB_OPTIMIZATIONS=diffusers-attention-slicing,onnx-fp16,onnx-low-memory
+```
+
 At least 12GB of VRAM is recommended for running all of the models in the extras file, but `onnx-web` should work on
 most 8GB cards and may work on some 6GB cards. 4GB is not supported yet, but [it should be
 possible](https://github.com/ssube/onnx-web/issues/241#issuecomment-1475341043).
 
 Based on somewhat limited testing, the model size memory usage for each optimization level is approximately:
 
-| Optimizations               | Disk Size | Memory Usage - 1 @ 512x512 | Supported Platforms |
-| --------------------------- | --------- | -------------------------- | ------------------- |
-| none                        | 4.0G      | 11.5G                      | all                 |
-| `onnx-fp16`                 | 2.2G      | 9.9G                       | all                 |
-| ORT script                  | 4.0G      | 6.6G                       | CUDA only           |
-| ORT script with `--float16` | 2.1G      | 5.8G                       | CUDA only           |
-| `torch-fp16`                | 2.0G      | 5.9G                       | CUDA only           |
+| Optimizations               | Disk Size | CUDA Memory Usage | DirectML Memory Usage | ROCm Memory Usage | Supported Platforms |
+| --------------------------- | --------- | ----------------- | --------------------- | ----------------- | ------------------- |
+| none                        | 4.0G      | 11.5G             | TODO                  | 8.5G              | all                 |
+| `onnx-fp16`                 | 2.2G      | 9.9G              | TODO                  | 4.5G              | all                 |
+| ORT script                  | 4.0G      | 6.6G              | -                     | -                 | CUDA only           |
+| ORT script with `--float16` | 2.1G      | 5.8G              | -                     | -                 | CUDA only           |
+| `torch-fp16`                | 2.0G      | 5.9G              | -                     | -                 | CUDA only           |
+
+All rows shown using a resolution of 512x512 and batch size of 1, measured on consecutive runs after the first load.
+The exact memory usage will depend on the model(s) you are using, the ONNX runtime version, and the CUDA/ROCm drivers
+on your system. These are approximate values, measured during testing and rounded up to the nearest 100MB.
 
 - https://github.com/microsoft/onnxruntime/tree/main/onnxruntime/python/tools/transformers/models/stable_diffusion#cuda-optimizations-for-stable-diffusion