Stable Diffusion Image Generation
Comprehensive guide to generating images with Stable Diffusion using the HuggingFace Diffusers library.
When to use Stable Diffusion
Use Stable Diffusion when:
-
Generating images from text descriptions
-
Performing image-to-image translation (style transfer, enhancement)
-
Inpainting (filling in masked regions)
-
Outpainting (extending images beyond boundaries)
-
Creating variations of existing images
-
Building custom image generation workflows
Key features:
-
Text-to-Image: Generate images from natural language prompts
-
Image-to-Image: Transform existing images with text guidance
-
Inpainting: Fill masked regions with context-aware content
-
ControlNet: Add spatial conditioning (edges, poses, depth)
-
LoRA Support: Efficient fine-tuning and style adaptation
-
Multiple Models: SD 1.5, SDXL, SD 3.0, Flux support
Use alternatives instead:
-
DALL-E 3: For API-based generation without GPU
-
Midjourney: For artistic, stylized outputs
-
Imagen: For Google Cloud integration
-
Leonardo.ai: For web-based creative workflows
Quick start
Installation
pip install diffusers transformers accelerate torch pip install xformers # Optional: memory-efficient attention
Basic text-to-image
from diffusers import DiffusionPipeline import torch
Load pipeline (auto-detects model type)
pipe = DiffusionPipeline.from_pretrained( "stable-diffusion-v1-5/stable-diffusion-v1-5", torch_dtype=torch.float16 ) pipe.to("cuda")
Generate image
image = pipe( "A serene mountain landscape at sunset, highly detailed", num_inference_steps=50, guidance_scale=7.5 ).images[0]
image.save("output.png")
Using SDXL (higher quality)
from diffusers import AutoPipelineForText2Image import torch
pipe = AutoPipelineForText2Image.from_pretrained( "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16" ) pipe.to("cuda")
Enable memory optimization
pipe.enable_model_cpu_offload()
image = pipe( prompt="A futuristic city with flying cars, cinematic lighting", height=1024, width=1024, num_inference_steps=30 ).images[0]
Architecture overview
Three-pillar design
Diffusers is built around three core components:
Pipeline (orchestration) ├── Model (neural networks) │ ├── UNet / Transformer (noise prediction) │ ├── VAE (latent encoding/decoding) │ └── Text Encoder (CLIP/T5) └── Scheduler (denoising algorithm)
Pipeline inference flow
Text Prompt → Text Encoder → Text Embeddings ↓ Random Noise → [Denoising Loop] ← Scheduler ↓ Predicted Noise ↓ VAE Decoder → Final Image
Core concepts
Pipelines
Pipelines orchestrate complete workflows:
Pipeline Purpose
StableDiffusionPipeline
Text-to-image (SD 1.x/2.x)
StableDiffusionXLPipeline
Text-to-image (SDXL)
StableDiffusion3Pipeline
Text-to-image (SD 3.0)
FluxPipeline
Text-to-image (Flux models)
StableDiffusionImg2ImgPipeline
Image-to-image
StableDiffusionInpaintPipeline
Inpainting
Schedulers
Schedulers control the denoising process:
Scheduler Steps Quality Use Case
EulerDiscreteScheduler
20-50 Good Default choice
EulerAncestralDiscreteScheduler
20-50 Good More variation
DPMSolverMultistepScheduler
15-25 Excellent Fast, high quality
DDIMScheduler
50-100 Good Deterministic
LCMScheduler
4-8 Good Very fast
UniPCMultistepScheduler
15-25 Excellent Fast convergence
Swapping schedulers
from diffusers import DPMSolverMultistepScheduler
Swap for faster generation
pipe.scheduler = DPMSolverMultistepScheduler.from_config( pipe.scheduler.config )
Now generate with fewer steps
image = pipe(prompt, num_inference_steps=20).images[0]
Generation parameters
Key parameters
Parameter Default Description
prompt
Required Text description of desired image
negative_prompt
None What to avoid in the image
num_inference_steps
50 Denoising steps (more = better quality)
guidance_scale
7.5 Prompt adherence (7-12 typical)
height , width
512/1024 Output dimensions (multiples of 8)
generator
None Torch generator for reproducibility
num_images_per_prompt
1 Batch size
Reproducible generation
import torch
generator = torch.Generator(device="cuda").manual_seed(42)
image = pipe( prompt="A cat wearing a top hat", generator=generator, num_inference_steps=50 ).images[0]
Negative prompts
image = pipe( prompt="Professional photo of a dog in a garden", negative_prompt="blurry, low quality, distorted, ugly, bad anatomy", guidance_scale=7.5 ).images[0]
Image-to-image
Transform existing images with text guidance:
from diffusers import AutoPipelineForImage2Image from PIL import Image
pipe = AutoPipelineForImage2Image.from_pretrained( "stable-diffusion-v1-5/stable-diffusion-v1-5", torch_dtype=torch.float16 ).to("cuda")
init_image = Image.open("input.jpg").resize((512, 512))
image = pipe( prompt="A watercolor painting of the scene", image=init_image, strength=0.75, # How much to transform (0-1) num_inference_steps=50 ).images[0]
Inpainting
Fill masked regions:
from diffusers import AutoPipelineForInpainting from PIL import Image
pipe = AutoPipelineForInpainting.from_pretrained( "runwayml/stable-diffusion-inpainting", torch_dtype=torch.float16 ).to("cuda")
image = Image.open("photo.jpg") mask = Image.open("mask.png") # White = inpaint region
result = pipe( prompt="A red car parked on the street", image=image, mask_image=mask, num_inference_steps=50 ).images[0]
ControlNet
Add spatial conditioning for precise control:
from diffusers import StableDiffusionControlNetPipeline, ControlNetModel import torch
Load ControlNet for edge conditioning
controlnet = ControlNetModel.from_pretrained( "lllyasviel/control_v11p_sd15_canny", torch_dtype=torch.float16 )
pipe = StableDiffusionControlNetPipeline.from_pretrained( "stable-diffusion-v1-5/stable-diffusion-v1-5", controlnet=controlnet, torch_dtype=torch.float16 ).to("cuda")
Use Canny edge image as control
control_image = get_canny_image(input_image)
image = pipe( prompt="A beautiful house in the style of Van Gogh", image=control_image, num_inference_steps=30 ).images[0]
Available ControlNets
ControlNet Input Type Use Case
canny
Edge maps Preserve structure
openpose
Pose skeletons Human poses
depth
Depth maps 3D-aware generation
normal
Normal maps Surface details
mlsd
Line segments Architectural lines
scribble
Rough sketches Sketch-to-image
LoRA adapters
Load fine-tuned style adapters:
from diffusers import DiffusionPipeline
pipe = DiffusionPipeline.from_pretrained( "stable-diffusion-v1-5/stable-diffusion-v1-5", torch_dtype=torch.float16 ).to("cuda")
Load LoRA weights
pipe.load_lora_weights("path/to/lora", weight_name="style.safetensors")
Generate with LoRA style
image = pipe("A portrait in the trained style").images[0]
Adjust LoRA strength
pipe.fuse_lora(lora_scale=0.8)
Unload LoRA
pipe.unload_lora_weights()
Multiple LoRAs
Load multiple LoRAs
pipe.load_lora_weights("lora1", adapter_name="style") pipe.load_lora_weights("lora2", adapter_name="character")
Set weights for each
pipe.set_adapters(["style", "character"], adapter_weights=[0.7, 0.5])
image = pipe("A portrait").images[0]
Memory optimization
Enable CPU offloading
Model CPU offload - moves models to CPU when not in use
pipe.enable_model_cpu_offload()
Sequential CPU offload - more aggressive, slower
pipe.enable_sequential_cpu_offload()
Attention slicing
Reduce memory by computing attention in chunks
pipe.enable_attention_slicing()
Or specific chunk size
pipe.enable_attention_slicing("max")
xFormers memory-efficient attention
Requires xformers package
pipe.enable_xformers_memory_efficient_attention()
VAE slicing for large images
Decode latents in tiles for large images
pipe.enable_vae_slicing() pipe.enable_vae_tiling()
Model variants
Loading different precisions
FP16 (recommended for GPU)
pipe = DiffusionPipeline.from_pretrained( "model-id", torch_dtype=torch.float16, variant="fp16" )
BF16 (better precision, requires Ampere+ GPU)
pipe = DiffusionPipeline.from_pretrained( "model-id", torch_dtype=torch.bfloat16 )
Loading specific components
from diffusers import UNet2DConditionModel, AutoencoderKL
Load custom VAE
vae = AutoencoderKL.from_pretrained("stabilityai/sd-vae-ft-mse")
Use with pipeline
pipe = DiffusionPipeline.from_pretrained( "stable-diffusion-v1-5/stable-diffusion-v1-5", vae=vae, torch_dtype=torch.float16 )
Batch generation
Generate multiple images efficiently:
Multiple prompts
prompts = [ "A cat playing piano", "A dog reading a book", "A bird painting a picture" ]
images = pipe(prompts, num_inference_steps=30).images
Multiple images per prompt
images = pipe( "A beautiful sunset", num_images_per_prompt=4, num_inference_steps=30 ).images
Common workflows
Workflow 1: High-quality generation
from diffusers import StableDiffusionXLPipeline, DPMSolverMultistepScheduler import torch
1. Load SDXL with optimizations
pipe = StableDiffusionXLPipeline.from_pretrained( "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16" ) pipe.to("cuda") pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config) pipe.enable_model_cpu_offload()
2. Generate with quality settings
image = pipe( prompt="A majestic lion in the savanna, golden hour lighting, 8k, detailed fur", negative_prompt="blurry, low quality, cartoon, anime, sketch", num_inference_steps=30, guidance_scale=7.5, height=1024, width=1024 ).images[0]
Workflow 2: Fast prototyping
from diffusers import AutoPipelineForText2Image, LCMScheduler import torch
Use LCM for 4-8 step generation
pipe = AutoPipelineForText2Image.from_pretrained( "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16 ).to("cuda")
Load LCM LoRA for fast generation
pipe.load_lora_weights("latent-consistency/lcm-lora-sdxl") pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config) pipe.fuse_lora()
Generate in ~1 second
image = pipe( "A beautiful landscape", num_inference_steps=4, guidance_scale=1.0 ).images[0]
Common issues
CUDA out of memory:
Enable memory optimizations
pipe.enable_model_cpu_offload() pipe.enable_attention_slicing() pipe.enable_vae_slicing()
Or use lower precision
pipe = DiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16)
Black/noise images:
Check VAE configuration
Use safety checker bypass if needed
pipe.safety_checker = None
Ensure proper dtype consistency
pipe = pipe.to(dtype=torch.float16)
Slow generation:
Use faster scheduler
from diffusers import DPMSolverMultistepScheduler pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
Reduce steps
image = pipe(prompt, num_inference_steps=20).images[0]
References
-
Advanced Usage - Custom pipelines, fine-tuning, deployment
-
Troubleshooting - Common issues and solutions
Resources
-
Documentation: https://huggingface.co/docs/diffusers
-
Repository: https://github.com/huggingface/diffusers
-
Discord: https://discord.gg/diffusers