fix(api): combine names for ONNX fp16 optimization

2023-03-27 08:55:01 -05:00 · 2023-03-27 08:55:01 -05:00 · c2f8fb1d31
parent 73e9cf8b66
commit c2f8fb1d31
3 changed files with 19 additions and 15 deletions
--- a/api/onnx_web/convert/main.py
+++ b/api/onnx_web/convert/main.py
@ -478,7 +478,7 @@ def main() -> int:
    logger.info("CLI arguments: %s", args)

    ctx = ConversionContext.from_environ()
-    ctx.half = args.half or "onnx-internal-fp16" in ctx.optimizations
+    ctx.half = args.half or "onnx-fp16" in ctx.optimizations
    ctx.opset = args.opset
    ctx.token = args.token
    logger.info("converting models in %s using %s", ctx.model_path, ctx.training_device)
--- a/docs/server-admin.md
+++ b/docs/server-admin.md
@ -102,9 +102,7 @@ Others:
  - `onnx-deterministic-compute`
    - enable ONNX deterministic compute
  - `onnx-fp16`
-    - force 16-bit floating point values when running pipelines
-    - use with https://github.com/microsoft/onnxruntime/tree/main/onnxruntime/python/tools/transformers/models/stable_diffusion#optimize-onnx-pipeline
-      and the `--float16` flag
+    - convert model nodes to 16-bit floating point values internally while leaving 32-bit inputs
  - `onnx-graph-*`
    - `onnx-graph-disable`
      - disable all ONNX graph optimizations
@ -112,9 +110,6 @@ Others:
      - enable basic ONNX graph optimizations
    - `onnx-graph-all`
      - enable all ONNX graph optimizations
-  - `onnx-internal-fp16`
-    - convert internal model nodes to 16-bit floating point values
-    - does not reduce disk space as much as `onnx-fp16` or `torch-fp16`, but does not incur as many extra conversions
  - `onnx-low-memory`
    - disable ONNX features that allocate more memory than is strictly required or keep memory after use
 - `torch-*`
--- a/docs/user-guide.md
+++ b/docs/user-guide.md
@ -725,20 +725,29 @@ Some common VAE models include:
 ### Optimizing models for lower memory usage

 Running Stable Diffusion with ONNX acceleration uses more memory by default than some other methods, but there are a
-number of optimizations that you can apply to reduce the memory usage.
+number of [server optimizations](server-admin.md#pipeline-optimizations) that you can apply to reduce the memory usage:
+
+- `diffusers-attention-slicing`
+- `onnx-fp16`
+- `onnx-graph-all`
+- `onnx-low-memory`
+- `torch-fp16`

 At least 12GB of VRAM is recommended for running all of the models in the extras file, but `onnx-web` should work on
 most 8GB cards and may work on some 6GB cards. 4GB is not supported yet, but [it should be
 possible](https://github.com/ssube/onnx-web/issues/241#issuecomment-1475341043).

- `diffusers-attention-slicing`
- `onnx-fp16`
- `onnx-internal-fp16`
- `onnx-graph-all`
- `onnx-low-memory`
- `torch-fp16`
+Based on somewhat limited testing, the model size memory usage for each optimization level is approximately:

-TODO: memory at different optimization levels
+| Optimizations               | Disk Size | Memory Usage - 1 @ 512x512 | Supported Platforms |
+| --------------------------- | --------- | -------------------------- | ------------------- |
+| none                        | 4.0G      | 11.5G                      | all                 |
+| `onnx-fp16`                 | 2.2G      | 9.9G                       | all                 |
+| ORT script                  | 4.0G      | 6.6G                       | CUDA only           |
+| ORT script with `--float16` | 2.1G      | 5.8G                       | CUDA only           |
+| `torch-fp16`                | 2.0G      | 5.9G                       | CUDA only           |
+
+- https://github.com/microsoft/onnxruntime/tree/main/onnxruntime/python/tools/transformers/models/stable_diffusion#cuda-optimizations-for-stable-diffusion

 ### Permanently blending additional networks