AUTOMATIC1111/stable-diffusion-webui

Training (Textual Inversion and Hypernetworks)

Active contributors: AUTOMATIC1111, DepFA, AngelBottomless, Kohaku-Blueleaf

Purpose

Two in-tree training pipelines: textual inversion embeddings and hypernetworks. Both let users adapt a frozen Stable Diffusion checkpoint without fine-tuning the full model. Lora training is not in this codebase; users typically use kohya_ss/sd-scripts for that and load the result through the Lora extension.

The training UI is a sub-tab of the Train top-level tab.

Code layout

modules/
├── textual_inversion/
│   ├── textual_inversion.py        # the trainer; ~770 lines
│   ├── dataset.py                  # PIL-based image+caption dataset
│   ├── learn_schedule.py           # learning rate scheduling DSL
│   ├── image_embedding.py          # encode/decode embeddings as PNG sidecars
│   ├── autocrop.py                 # face-aware preprocessing
│   ├── saving_settings.py          # per-embedding save metadata
│   └── ui.py                       # the Train tab create_embedding helper
├── hypernetworks/
│   ├── hypernetwork.py             # ~36 KB; trainer + module + UI registration
│   └── ui.py                       # create_hypernetwork helper
├── api/api.py                      # /sdapi/v1/train/* and /create/* endpoints
└── ui.py                           # the Train tab assembly
textual_inversion_templates/        # prompt templates used during training (e.g. "subject_filewords.txt")
embeddings/                         # output destination for trained embeddings
models/hypernetworks/               # output destination for hypernets

What can be trained

Concept	What it actually is	File extension	Inference path
Textual inversion embedding	A single learned token vector (or a list of them, one per "vector per token") that maps to a specific concept.	`.pt`, `.safetensors`, `.bin`	Substituted into CLIP at encode time by `EmbeddingDatabase` (see systems/sd-hijack.md).
Hypernetwork	A small MLP whose output is added to the K and V projections in the UNet's cross-attention layers.	`.pt`	Patched in by `Hypernetwork.attach()` and applied via `extra_networks_hypernet`.

Lifecycle: textual inversion

sequenceDiagram
    participant User
    participant CreateUI as Train tab Create
    participant Trainer as textual_inversion.train_embedding
    participant Dataset
    participant CLIP
    participant Optim as torch.optim

    User->>CreateUI: name, init text, vectors per token
    CreateUI->>Trainer: create_embedding(...)
    Trainer-->>CreateUI: empty .pt in embeddings/

    User->>CreateUI: select images dir, training settings, click Train
    CreateUI->>Trainer: train_embedding(args)
    Trainer->>Dataset: PersonalizedBase
    loop steps
        Dataset->>Trainer: image + caption template
        Trainer->>CLIP: encode prompt (with embedding token)
        Trainer->>UNet: noise prediction
        Trainer->>Optim: backprop only the embedding tensor
        Trainer->>Trainer: write_image_embedding (preview every N)
        Trainer->>Trainer: save .pt every N
    end

Key implementation points in modules/textual_inversion/textual_inversion.py:

Embedding is the in-memory record (vec, name, step, sd_checkpoint).
EmbeddingDatabase walks the configured directories at boot and on /sdapi/v1/refresh-embeddings.
train_embedding(...) is the trainer; ~330 lines. It supports gradient accumulation, EMA, learning rate scheduling, and cross-attention masking.
create_embedding(name, num_vectors_per_token, ...) initialises a new .pt from an init text or random vectors.
Image previews during training are saved into textual_inversion/<embedding>/; they share the embedded-as-PNG format implemented in image_embedding.py.

Lifecycle: hypernetworks

modules/hypernetworks/hypernetwork.py is one of the largest single files in the repo (~36 KB) — it contains both the model definition (HypernetworkModule, Hypernetwork) and the training loop (train_hypernetwork()).

The HypernetworkModule is a small linear stack inserted into each cross-attention K/V projection. It can be tanh-activated, dropout'd, layer-normed; the user picks the architecture string at creation time. Training ratio of 1:1 between the hypernet and CLIP/Lora is unusual: it produces relatively heavy weights for fairly subtle effects.

Hypernetworks are deprecated in favour of Lora, but the code still works. The last meaningful change in this file was a v1.7-era stability fix.

Dataset and templates

PersonalizedBase (modules/textual_inversion/dataset.py) reads images and (optional) per-image caption files from a directory. It supports cropping, mirroring, and on-the-fly tagging via deepdanbooru/BLIP if the image has no caption.
textual_inversion_templates/ holds prompt templates with placeholders: [name] is replaced with the embedding name, [filewords] with the per-image caption.
autocrop.py is a focal-point detector that can pick a crop centred on a face/feature. Used during preprocessing.

The Preprocess tab

A separate sub-tab under Train preprocesses an image directory before training: resize/crop, mirror, autotag with BLIP (caption) or deepdanbooru (anime tags), split into training and validation. Implemented inline in modules/ui.py and modules/textual_inversion/preprocess.py (lazy-loaded).

API

Endpoint	Action
`POST /sdapi/v1/create/embedding`	Create a new empty embedding file
`POST /sdapi/v1/create/hypernetwork`	Create a new empty hypernetwork
`POST /sdapi/v1/train/embedding`	Train an embedding
`POST /sdapi/v1/train/hypernetwork`	Train a hypernetwork
`GET /sdapi/v1/embeddings`	List embeddings (loaded + skipped)
`POST /sdapi/v1/refresh-embeddings`	Rescan the embeddings directories
`GET /sdapi/v1/hypernetworks`	List hypernetworks
`POST /sdapi/v1/preprocess`	Preprocess a folder (legacy; resizes/captions images)

The training endpoints block until training finishes — they are intended to be used asynchronously alongside /sdapi/v1/progress.

What this code does not do

No DreamBooth / full fine-tuning. This codebase only trains embeddings or hypernetworks. Use a separate trainer like kohya_ss/sd-scripts for Lora or DreamBooth.
No SDXL textual inversion training. The trainer was written for SD 1.x's CLIP encoder. SDXL has two text encoders, and the in-tree trainer doesn't address that — embeddings can still be loaded, just not trained here.
No multi-GPU. The training loop is single-process, single-device.

Integration points

script_callbacks.on_ui_train_tabs(callback) lets extensions add their own train sub-tabs. Used by some Lora-trainer extensions.
The EmbeddingDatabase is on model_hijack, so on_model_loaded is the right callback for "react to embeddings being available".

Entry points for modification

Embeddings as .safetensors — already supported on the load side; saving uses safetensors.torch.save_file. To change save format, edit Embedding.save() in modules/textual_inversion/textual_inversion.py.
Better LR schedules — learn_schedule.py parses an Automatic1111-specific DSL like 0.005:100, 0.001:500. Extending the grammar lives there.
Logging — TensorBoard logging is available behind an opt-in setting; the integration is minimal and routed through tensorboard_setup in the trainer.

Built by Factory AutoWiki from public repository content. It is a generated preview for codebase exploration, not source-maintained documentation.