fix(docs): explain how to optimize models
This commit is contained in:
parent
ae3bcf3b8b
commit
31e841ab4a
|
@ -19,6 +19,10 @@ Textual Inversion embeddings.
|
||||||
- [LoRA weights from kohya-ss/sd-scripts](#lora-weights-from-kohya-sssd-scripts)
|
- [LoRA weights from kohya-ss/sd-scripts](#lora-weights-from-kohya-sssd-scripts)
|
||||||
- [Converting Textual Inversion embeddings](#converting-textual-inversion-embeddings)
|
- [Converting Textual Inversion embeddings](#converting-textual-inversion-embeddings)
|
||||||
- [Figuring out how many layers are in a Textual Inversion](#figuring-out-how-many-layers-are-in-a-textual-inversion)
|
- [Figuring out how many layers are in a Textual Inversion](#figuring-out-how-many-layers-are-in-a-textual-inversion)
|
||||||
|
- [Optimizing diffusers models](#optimizing-diffusers-models)
|
||||||
|
- [Converting to float16](#converting-to-float16)
|
||||||
|
- [Optimizing with ONNX runtime](#optimizing-with-onnx-runtime)
|
||||||
|
- [Optimizing with HuggingFace Optimum](#optimizing-with-huggingface-optimum)
|
||||||
|
|
||||||
## Conversion steps for each type of model
|
## Conversion steps for each type of model
|
||||||
|
|
||||||
|
@ -279,3 +283,63 @@ lin-7', 'goblin-8', 'goblin-9', 'goblin-10', 'goblin-11', 'goblin-12', 'goblin-1
|
||||||
|
|
||||||
You do not need to know how many layers a Textual Inversion has to use the base token, `goblin` or `goblin-all` in this
|
You do not need to know how many layers a Textual Inversion has to use the base token, `goblin` or `goblin-all` in this
|
||||||
example, but it does allow you to control the layers individually.
|
example, but it does allow you to control the layers individually.
|
||||||
|
|
||||||
|
## Optimizing diffusers models
|
||||||
|
|
||||||
|
The ONNX models often include redundant nodes, like division by 1, and are converted using 32-bit floating point
|
||||||
|
numbers by default. The models can be optimized to remove some of those nodes and reduce their size, both on disk and
|
||||||
|
in VRAM.
|
||||||
|
|
||||||
|
The highest levels of optimization will make the converted models platform-specific and must be done after blending
|
||||||
|
LoRAs and Textual Inversions, so you cannot select them in the prompt, but reduces memory usage by 50-75%.
|
||||||
|
|
||||||
|
### Converting to float16
|
||||||
|
|
||||||
|
The size of a model can be roughly cut in half, on disk and in memory, by converting it from float32 to float16. There
|
||||||
|
are a few different levels of conversion, which become increasingly platforms-specific.
|
||||||
|
|
||||||
|
1. Internal conversion
|
||||||
|
- Converts graph nodes to float16 operations
|
||||||
|
- Leaves inputs and outputs as float32
|
||||||
|
- Initializer data can be converted to float16 or kept as float32
|
||||||
|
2. Full conversion
|
||||||
|
- Can be done with ONNX runtime or Torch
|
||||||
|
- Converts inputs, outputs, nodes, and initializer data to float16
|
||||||
|
- Breaks runtime LoRA and Textual Inversion blending
|
||||||
|
- Requires some additional data conversions at runtime, which may introduce subtle rounding errors
|
||||||
|
|
||||||
|
Using Stable Diffusion v1.5 as an example, full conversion reduces the size of the model by about half:
|
||||||
|
|
||||||
|
```none
|
||||||
|
4.0G ./stable-diffusion-v1-5-fp32
|
||||||
|
4.0G ./stable-diffusion-v1-5-fp32-optimized
|
||||||
|
2.6G ./stable-diffusion-v1-5-fp16-internal
|
||||||
|
2.3G ./stable-diffusion-v1-5-fp16-optimized
|
||||||
|
2.0G ./stable-diffusion-v1-5-fp16-torch
|
||||||
|
```
|
||||||
|
|
||||||
|
Combined with [the other ONNX optimizations](server-admin.md#pipeline-optimizations), this can make the pipeline usable
|
||||||
|
on 4-6GB GPUs and allow much larger batch sizes on GPUs with more memory. The optimized float32 model uses somewhat
|
||||||
|
less VRAM than the original model, despite being the same size on disk.
|
||||||
|
|
||||||
|
### Optimizing with ONNX runtime
|
||||||
|
|
||||||
|
The ONNX runtime provides an optimization script for Stable Diffusion models in their git repository. You will need to
|
||||||
|
clone that repository, but you can use an existing virtual environment for `onnx-web` and should not need to install
|
||||||
|
any new packages.
|
||||||
|
|
||||||
|
```shell
|
||||||
|
> git clone https://github.com/microsoft/onnxruntime
|
||||||
|
> cd onnxruntime/onnxruntime/python/tools/transformers/models/stable_diffusion
|
||||||
|
> python3 optimize_pipeline.py -i /home/ssube/onnx-web/models/stable-diffusion
|
||||||
|
```
|
||||||
|
|
||||||
|
The `optimize_pipeline.py` script should work on any [diffusers directory with ONNX models](#converting-diffusers-models),
|
||||||
|
but you will need to use the `--use_external_data_format` option if you are not using `--float16`. See the `--help` for
|
||||||
|
more details.
|
||||||
|
|
||||||
|
- https://github.com/microsoft/onnxruntime/tree/main/onnxruntime/python/tools/transformers/models/stable_diffusion
|
||||||
|
|
||||||
|
### Optimizing with HuggingFace Optimum
|
||||||
|
|
||||||
|
- https://huggingface.co/docs/optimum/v1.7.1/en/onnxruntime/usage_guides/optimization#optimizing-a-model-with-optimum-cli
|
||||||
|
|
Loading…
Reference in New Issue