Stable Diffusion Image Generation

Comprehensive guide to generating images with Stable Diffusion using the HuggingFace Diffusers library.

When to use Stable Diffusion

Use Stable Diffusion when:

Generating images from text descriptions
Performing image-to-image translation (style transfer, enhancement)
Inpainting (filling in masked regions)
Outpainting (extending images beyond boundaries)
Creating variations of existing images
Building custom image generation workflows

Key features:

Text-to-Image: Generate images from natural language prompts
Image-to-Image: Transform existing images with text guidance
Inpainting: Fill masked regions with context-aware content
ControlNet: Add spatial conditioning (edges, poses, depth)
LoRA Support: Efficient fine-tuning and style adaptation
Multiple Models: SD 1.5, SDXL, SD 3.0, Flux support

Use alternatives instead:

DALL-E 3: For API-based generation without GPU
Midjourney: For artistic, stylized outputs
Imagen: For Google Cloud integration
Leonardo.ai: For web-based creative workflows

Quick start

Installation

pip install diffusers transformers accelerate torch pip install xformers # Optional: memory-efficient attention

Basic text-to-image

from diffusers import DiffusionPipeline import torch

Load pipeline (auto-detects model type)

pipe = DiffusionPipeline.from_pretrained( "stable-diffusion-v1-5/stable-diffusion-v1-5", torch_dtype=torch.float16 ) pipe.to("cuda")

Generate image

image = pipe( "A serene mountain landscape at sunset, highly detailed", num_inference_steps=50, guidance_scale=7.5 ).images[0]

image.save("output.png")

Using SDXL (higher quality)

from diffusers import AutoPipelineForText2Image import torch

pipe = AutoPipelineForText2Image.from_pretrained( "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16" ) pipe.to("cuda")

Enable memory optimization

pipe.enable_model_cpu_offload()

image = pipe( prompt="A futuristic city with flying cars, cinematic lighting", height=1024, width=1024, num_inference_steps=30 ).images[0]

Architecture overview

Three-pillar design

Diffusers is built around three core components:

Pipeline (orchestration) ├── Model (neural networks) │ ├── UNet / Transformer (noise prediction) │ ├── VAE (latent encoding/decoding) │ └── Text Encoder (CLIP/T5) └── Scheduler (denoising algorithm)

Pipeline inference flow

Text Prompt → Text Encoder → Text Embeddings ↓ Random Noise → [Denoising Loop] ← Scheduler ↓ Predicted Noise ↓ VAE Decoder → Final Image

Core concepts

Pipelines

Pipelines orchestrate complete workflows:

Pipeline Purpose

StableDiffusionPipeline

Text-to-image (SD 1.x/2.x)

StableDiffusionXLPipeline

Text-to-image (SDXL)

StableDiffusion3Pipeline

Text-to-image (SD 3.0)

FluxPipeline

Text-to-image (Flux models)

StableDiffusionImg2ImgPipeline

Image-to-image

StableDiffusionInpaintPipeline

Inpainting

Schedulers

Schedulers control the denoising process:

Scheduler Steps Quality Use Case

EulerDiscreteScheduler

20-50 Good Default choice

EulerAncestralDiscreteScheduler

20-50 Good More variation

DPMSolverMultistepScheduler

15-25 Excellent Fast, high quality

DDIMScheduler

50-100 Good Deterministic

LCMScheduler

4-8 Good Very fast

UniPCMultistepScheduler

15-25 Excellent Fast convergence

Swapping schedulers

from diffusers import DPMSolverMultistepScheduler

Swap for faster generation

pipe.scheduler = DPMSolverMultistepScheduler.from_config( pipe.scheduler.config )

Now generate with fewer steps

image = pipe(prompt, num_inference_steps=20).images[0]

Generation parameters

Key parameters

Parameter Default Description

prompt

Required Text description of desired image

negative_prompt

None What to avoid in the image

num_inference_steps

50 Denoising steps (more = better quality)

guidance_scale

7.5 Prompt adherence (7-12 typical)

height , width

512/1024 Output dimensions (multiples of 8)

generator

None Torch generator for reproducibility

num_images_per_prompt

1 Batch size

Reproducible generation

import torch

generator = torch.Generator(device="cuda").manual_seed(42)

image = pipe( prompt="A cat wearing a top hat", generator=generator, num_inference_steps=50 ).images[0]

Negative prompts

image = pipe( prompt="Professional photo of a dog in a garden", negative_prompt="blurry, low quality, distorted, ugly, bad anatomy", guidance_scale=7.5 ).images[0]

Image-to-image

Transform existing images with text guidance:

from diffusers import AutoPipelineForImage2Image from PIL import Image

pipe = AutoPipelineForImage2Image.from_pretrained( "stable-diffusion-v1-5/stable-diffusion-v1-5", torch_dtype=torch.float16 ).to("cuda")

init_image = Image.open("input.jpg").resize((512, 512))

image = pipe( prompt="A watercolor painting of the scene", image=init_image, strength=0.75, # How much to transform (0-1) num_inference_steps=50 ).images[0]

Inpainting

Fill masked regions:

from diffusers import AutoPipelineForInpainting from PIL import Image

pipe = AutoPipelineForInpainting.from_pretrained( "runwayml/stable-diffusion-inpainting", torch_dtype=torch.float16 ).to("cuda")

image = Image.open("photo.jpg") mask = Image.open("mask.png") # White = inpaint region

result = pipe( prompt="A red car parked on the street", image=image, mask_image=mask, num_inference_steps=50 ).images[0]

ControlNet

Add spatial conditioning for precise control:

from diffusers import StableDiffusionControlNetPipeline, ControlNetModel import torch

Load ControlNet for edge conditioning

controlnet = ControlNetModel.from_pretrained( "lllyasviel/control_v11p_sd15_canny", torch_dtype=torch.float16 )

pipe = StableDiffusionControlNetPipeline.from_pretrained( "stable-diffusion-v1-5/stable-diffusion-v1-5", controlnet=controlnet, torch_dtype=torch.float16 ).to("cuda")

Use Canny edge image as control

control_image = get_canny_image(input_image)

image = pipe( prompt="A beautiful house in the style of Van Gogh", image=control_image, num_inference_steps=30 ).images[0]

Available ControlNets

ControlNet Input Type Use Case

canny

Edge maps Preserve structure

openpose

Pose skeletons Human poses

depth

Depth maps 3D-aware generation

normal

Normal maps Surface details

mlsd

Line segments Architectural lines

scribble

Rough sketches Sketch-to-image

LoRA adapters

Load fine-tuned style adapters:

from diffusers import DiffusionPipeline

pipe = DiffusionPipeline.from_pretrained( "stable-diffusion-v1-5/stable-diffusion-v1-5", torch_dtype=torch.float16 ).to("cuda")

Load LoRA weights

pipe.load_lora_weights("path/to/lora", weight_name="style.safetensors")

Generate with LoRA style

image = pipe("A portrait in the trained style").images[0]

Adjust LoRA strength

pipe.fuse_lora(lora_scale=0.8)

Unload LoRA

pipe.unload_lora_weights()

Multiple LoRAs

Load multiple LoRAs

pipe.load_lora_weights("lora1", adapter_name="style") pipe.load_lora_weights("lora2", adapter_name="character")

Set weights for each

pipe.set_adapters(["style", "character"], adapter_weights=[0.7, 0.5])

image = pipe("A portrait").images[0]

Memory optimization

Enable CPU offloading

Model CPU offload - moves models to CPU when not in use

pipe.enable_model_cpu_offload()

Sequential CPU offload - more aggressive, slower

pipe.enable_sequential_cpu_offload()

Attention slicing

Reduce memory by computing attention in chunks

pipe.enable_attention_slicing()

Or specific chunk size

pipe.enable_attention_slicing("max")

xFormers memory-efficient attention

Requires xformers package

pipe.enable_xformers_memory_efficient_attention()

VAE slicing for large images

Decode latents in tiles for large images

pipe.enable_vae_slicing() pipe.enable_vae_tiling()

Model variants

Loading different precisions

FP16 (recommended for GPU)

pipe = DiffusionPipeline.from_pretrained( "model-id", torch_dtype=torch.float16, variant="fp16" )

BF16 (better precision, requires Ampere+ GPU)

pipe = DiffusionPipeline.from_pretrained( "model-id", torch_dtype=torch.bfloat16 )

Loading specific components

from diffusers import UNet2DConditionModel, AutoencoderKL

Load custom VAE

vae = AutoencoderKL.from_pretrained("stabilityai/sd-vae-ft-mse")

Use with pipeline

pipe = DiffusionPipeline.from_pretrained( "stable-diffusion-v1-5/stable-diffusion-v1-5", vae=vae, torch_dtype=torch.float16 )

Batch generation

Generate multiple images efficiently:

Multiple prompts

prompts = [ "A cat playing piano", "A dog reading a book", "A bird painting a picture" ]

images = pipe(prompts, num_inference_steps=30).images

Multiple images per prompt

images = pipe( "A beautiful sunset", num_images_per_prompt=4, num_inference_steps=30 ).images

Common workflows

Workflow 1: High-quality generation

from diffusers import StableDiffusionXLPipeline, DPMSolverMultistepScheduler import torch

1. Load SDXL with optimizations

pipe = StableDiffusionXLPipeline.from_pretrained( "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16" ) pipe.to("cuda") pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config) pipe.enable_model_cpu_offload()

2. Generate with quality settings

image = pipe( prompt="A majestic lion in the savanna, golden hour lighting, 8k, detailed fur", negative_prompt="blurry, low quality, cartoon, anime, sketch", num_inference_steps=30, guidance_scale=7.5, height=1024, width=1024 ).images[0]

Workflow 2: Fast prototyping

from diffusers import AutoPipelineForText2Image, LCMScheduler import torch

Use LCM for 4-8 step generation

pipe = AutoPipelineForText2Image.from_pretrained( "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16 ).to("cuda")

Load LCM LoRA for fast generation

pipe.load_lora_weights("latent-consistency/lcm-lora-sdxl") pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config) pipe.fuse_lora()

Generate in ~1 second

image = pipe( "A beautiful landscape", num_inference_steps=4, guidance_scale=1.0 ).images[0]

Common issues

CUDA out of memory:

Enable memory optimizations

pipe.enable_model_cpu_offload() pipe.enable_attention_slicing() pipe.enable_vae_slicing()

Or use lower precision

pipe = DiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16)

Black/noise images:

Check VAE configuration

Use safety checker bypass if needed

pipe.safety_checker = None

Ensure proper dtype consistency

pipe = pipe.to(dtype=torch.float16)

Slow generation:

Use faster scheduler

from diffusers import DPMSolverMultistepScheduler pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)

Reduce steps