Image Generation
This skill enables generating images using Google's Gemini API (specifically the "Nano Banana" models) or OpenRouter.
Models
Google Gemini API
-
Nano Banana (gemini-2.5-flash-image ): Designed for speed and efficiency. Default model.
-
Nano Banana Pro (gemini-3-pro-image-preview ): Built on Gemini 3, this is the most advanced image model. Key capabilities:
-
State-of-the-art text rendering in multiple languages
-
Real-world knowledge and deep reasoning for precise, detailed results
-
Advanced controls: up to 14 input images for composition
-
Studio-quality editing: lighting, camera settings, color grading
-
High-fidelity resolutions: 1K, 2K, and 4K
-
Brand consistency and character resemblance across edits
Prerequisites
You need an API key to use this skill. The recommended way is to create a .env file in this skill's directory (<your-skill-directory>/image-generation/ ), but placing it in the workspace root is also supported.
-
Create a .env file in the skill directory: touch <your-skill-directory>/image-generation/.env
-
Add your API key(s) to the .env file: GEMINI_API_KEY=AIzaSy...
OR
OPENROUTER_API_KEY=sk-or-v1-...
Usage
To generate an image, run the script located in this skill's directory.
Command
<your-skill-directory>/image-generation/scripts/generate_image.ts "Your prompt here"
Options
-
--output , -o : Specify output filename (default: image_TIMESTAMP.png )
-
--provider : Force openrouter or gemini . If not specified, auto-detects based on available API keys:
-
If only OPENROUTER_API_KEY is set: uses OpenRouter
-
If only GEMINI_API_KEY is set: uses Gemini
-
If both keys are set: defaults to Gemini
-
--model : Specify a custom model ID (default: google/gemini-3-pro-image-preview for OpenRouter, gemini-2.5-flash-image for Gemini)
-
--aspect-ratio : Aspect ratio (e.g., 1:1 , 16:9 , 4:3 , 3:4 , 5:4 , 9:16 ). Default: 1:1 .
-
--image-size : Resolution (1K , 2K , 4K ). Only supported by gemini-3-pro-image-preview .
-
--input-images : Paths to input images for editing or composition (supported by Gemini).
-
--api-key : Pass API key directly if not in env vars
Important: Do NOT specify --provider or --model unless explicitly requested by the user. The script auto-detects the provider based on available API keys and uses sensible defaults for the model. Important: Before specifying an output path with --output , ALWAYS check if a file already exists at that path. NEVER overwrite an existing image unless the user explicitly requests it. When in doubt, let the script generate a timestamped filename automatically.
Examples
Example 1: Basic Generation
User: "Generate an image of a futuristic city."
Action:
<your-skill-directory>/image-generation/scripts/generate_image.ts "A futuristic city with flying cars and neon lights"
Example 2: High Quality with 4K Resolution
User: "Create a detailed 4K infographic about photosynthesis"
Action:
<your-skill-directory>/image-generation/scripts/generate_image.ts "A vibrant infographic that explains photosynthesis" --model gemini-3-pro-image-preview --aspect-ratio 16:9 --image-size 4K
Example 3: Specific Aspect Ratio
User: "Draw a portrait of a cat in 3:4 ratio"
Action:
<your-skill-directory>/image-generation/scripts/generate_image.ts "A portrait of a fluffy cat" --aspect-ratio 3:4
Example 4: Image Editing (Add/Remove Elements)
User: "Add a wizard hat to this cat image"
Action:
<your-skill-directory>/image-generation/scripts/generate_image.ts "Add a small knitted wizard hat on the cat's head. Make it look natural." --input-images cat.png
Example 5: Style Transfer
User: "Make this city photo look like a Van Gogh painting"
Action:
<your-skill-directory>/image-generation/scripts/generate_image.ts "Transform this photograph into the artistic style of Vincent van Gogh's Starry Night. Preserve the composition but render with swirling brushstrokes and deep blues." --input-images city.png
Example 6: Combining Multiple Images
User: "Put this dress on this model"
Action:
<your-skill-directory>/image-generation/scripts/generate_image.ts "Create a professional fashion photo. The woman from the second image is wearing the dress from the first image." --input-images dress.png model.png
Image Editing Guide
Image editing (text-and-image-to-image) allows you to provide images alongside text prompts to add, remove, or modify elements, change style, or compose new scenes.
Input Image Limits
-
gemini-2.5-flash-image : Works best with up to 3 input images.
-
gemini-3-pro-image-preview : Supports up to 5 images with high fidelity, and up to 14 images total.
Editing Templates
- Adding and Removing Elements
Provide an image and describe your change. The model matches the original style, lighting, and perspective.
Template:
Using the provided image of [subject], please [add/remove/modify] [element] to/from the scene. Ensure the change [description of how it should integrate].
- Inpainting (Semantic Masking)
Conversationally define a "mask" to edit a specific part while leaving the rest untouched.
Template:
Using the provided image, change only the [specific element] to [new description]. Keep everything else exactly the same, preserving the original style, lighting, and composition.
- Style Transfer
Recreate an image's content in a different artistic style.
Template:
Transform the provided photograph of [subject] into the artistic style of [artist/style]. Preserve the original composition but render with [stylistic elements].
- Advanced Composition (Multiple Images)
Combine elements from multiple images into a new scene.
Template:
Create a new image by combining the elements from the provided images. Take the [element from image 1] and place it with/on the [element from image 2]. The final image should be [description].
- High-Fidelity Detail Preservation
To ensure critical details (face, logo) are preserved, describe them explicitly.
Template:
Using the provided images, place [element from image 2] onto [element from image 1]. Ensure that the features of [element] remain completely unchanged. The added element should [integration description].
- Sketch to Finished Image
Turn rough sketches into polished renders.
Template:
Turn this rough [medium] sketch of a [subject] into a [style description] photo. Keep the [specific features] from the sketch but add [new details].
- Character Consistency (360 View)
Generate different angles of a character by iteratively prompting.
Template:
A studio portrait of this [person/character] against [background], [pose/angle description].
Prompting Guide
Mastering image generation starts with one fundamental principle: Describe the scene, don't just list keywords. A narrative, descriptive paragraph will almost always produce a better, more coherent image than a list of disconnected words.
Important: Before crafting a prompt, first check image-generation/assets/ for existing curated prompt templates that match the user's request. Use or adapt these templates when available.
Establishing the Vision: Core Prompt Elements
To achieve the best results and have nuanced creative control, structure your prompts with these elements:
Element Description Example
Subject Who or what is in the image? Be specific. "A stoic robot barista with glowing blue optics" or "A fluffy calico cat wearing a tiny wizard hat"
Composition How is the shot framed? "Extreme close-up", "wide shot", "low angle shot", "portrait"
Action What is happening? "Brewing a cup of coffee", "casting a magical spell", "mid-stride running through a field"
Location Where does the scene take place? "A futuristic cafe on Mars", "a cluttered alchemist's library", "a sun-drenched meadow at golden hour"
Style What is the overall aesthetic? "3D animation", "film noir", "watercolor painting", "photorealistic", "1990s product photography"
Editing Instructions For modifying existing images, be direct and specific. "Change the man's tie to green", "remove the car in the background"
Refining the Details: Advanced Controls
While simple prompts work, achieving professional results requires more specific instructions:
-
Composition and aspect ratio: Define the canvas. (e.g., "A 9:16 vertical poster," "A cinematic 21:9 wide shot.")
-
Camera and lighting details: Direct the shot like a cinematographer. (e.g., "A low-angle shot with a shallow depth of field (f/1.8)," "Golden hour backlighting creating long shadows," "Cinematic color grading with muted teal tones.")
-
Specific text integration: Clearly state what text should appear and how. (e.g., "The headline 'URBAN EXPLORER' rendered in bold, white, sans-serif font at the top.")
-
Factual constraints (for diagrams): Specify accuracy needs and ensure inputs are factual. (e.g., "A scientifically accurate cross-section diagram," "Ensure historical accuracy for the Victorian era.")
-
Reference inputs: When using uploaded images, clearly define each role. (e.g., "Use Image A for the character's pose, Image B for the art style, and Image C for the background environment.")
- Photorealistic scenes
For realistic images, use photography terms. Mention camera angles, lens types, lighting, and fine details.
Template:
A photorealistic [shot type] of [subject], [action or expression], set in [environment]. The scene is illuminated by [lighting description], creating a [mood] atmosphere. Captured with a [camera/lens details], emphasizing [key textures and details].
- Stylized illustrations & stickers
To create stickers, icons, or assets, be explicit about the style and request a transparent background (if needed).
Template:
A [style] sticker of a [subject], featuring [key characteristics] and a [color palette]. The design should have [line style] and [shading style]. The background must be transparent.
- Accurate text in images
Gemini excels at rendering text in multiple languages. Be clear about the text, the font style (descriptively), and the overall design. Use Gemini 3 Pro Image Preview for professional asset production.
Template:
Create a [image type] for [brand/concept] with the text "[text to render]" in a [font style]. The design should be [style description], with a [color scheme].
Example (wordplay):
Create an image showing the phrase "How much wood would a woodchuck chuck if a woodchuck could chuck wood" made out of wood chucked by a woodchuck.
- Product mockups & commercial photography
Perfect for creating clean, professional product shots for ecommerce, advertising, or branding.
Template:
A high-resolution, studio-lit product photograph of a [product description] on a [background surface/description]. The lighting is a [lighting setup] to [lighting purpose]. The camera angle is a [angle type] to showcase [specific feature]. Ultra-realistic, with sharp focus on [key detail].
- Minimalist & negative space design
Excellent for creating backgrounds for websites, presentations, or marketing materials where text will be overlaid.
Template:
A minimalist composition featuring a single [subject] positioned in the [bottom-right/top-left/etc.] of the frame. The background is a vast, empty [color] canvas, creating significant negative space. Soft, subtle lighting.
- Sequential art (Comic panel / Storyboard)
Builds on character consistency and scene description to create panels for visual storytelling.
Template:
Make a 3 panel comic in a [style]. Put the character in a [type of scene].
Example:
Create a storyboard for this scene (with input image)
- Translation and Localization
Generate localized text, or translate text inside images for international markets.
Template:
Translate all the [source language] text on [object description] into [target language], while keeping everything else the same.
Example:
Translate all the English text on the three yellow and blue cans into Korean, while keeping everything else the same.
- Infographics and Diagrams
Create data-driven visuals with real-world knowledge. Always verify factual accuracy of generated diagrams.
Template:
Create an infographic that shows [topic/process]. Include [specific elements]. Use a [style] design with [color scheme].
Example:
Create an infographic that shows how to make elaichi chai.
- Studio-Quality Lighting and Camera Control
Directly influence lighting and camera settings for professional results.
Template:
[Scene description]. Use [lighting type] lighting. Camera settings: [angle], [focus], [color grading].
Examples:
Turn this scene into nighttime (with input image) Focus on the flowers (with input image)
- Brand Identity Systems
Render and apply designs with consistent brand styling. Create logo variations and mockups.
Template (Logo):
Create a smooth logo in a [style] based on [concept]. The letters should be [letter style description]. Use [color palette].
Template (Identity System):
Now create identity system one by one, use [number] high quality mockups with variety of relevant products, ads, billboards, bus stop, etc. Generate one at a time, 16:9 each.
- Image Blending and Character Consistency
Combine multiple images while maintaining character resemblance. Supports 6-14 input images depending on surface.
Template:
Combine these images into one appropriately arranged [style] image in [aspect ratio] format. [Specific instructions for each element].
Example:
Combine these images into one appropriately arranged cinematic image in 16:9 format and change the dress on the mannequin to the dress in the image.
Best Practices
-
Be Hyper-Specific: Instead of "fantasy armor," describe: "ornate elven plate armor, etched with silver leaf patterns, with a high collar and pauldrons shaped like falcon wings."
-
Provide Context and Intent: Explain the purpose. "Create a logo for a high-end, minimalist skincare brand" yields better results than just "Create a logo."
-
Iterate and Refine: Use follow-up prompts like "That's great, but make the lighting warmer" or "Keep everything the same, but change the expression."
-
Use Step-by-Step Instructions: For complex scenes, break into steps. "First, create a background of a misty forest. Then, add a stone altar. Finally, place a glowing sword on the altar."
-
Use Semantic Negative Prompts: Instead of "no cars," describe positively: "an empty, deserted street with no signs of traffic."
-
Control the Camera: Use photography terms like wide-angle shot , macro shot , low-angle perspective , 85mm portrait lens , shallow depth of field (f/1.8) .
-
Image Order Matters: When using input images with text, the text prompt typically comes after the images.
-
Leverage Real-World Knowledge: Nano Banana Pro uses Gemini 3's reasoning capabilities—ask for historically accurate, scientifically precise, or culturally specific content.
-
Experiment with Aspect Ratios: Different ratios suit different purposes. Generate crisp visuals at 1K, 2K, or 4K across 9:16 (vertical poster), 16:9 (cinematic), 1:1 (social), etc.
-
Define Reference Image Roles: When using multiple images, explicitly state each image's purpose (e.g., "Image A for pose, Image B for style, Image C for background").
Limitations
General Limits
-
Languages: Best performance in EN, ar-EG, de-DE, es-MX, fr-FR, hi-IN, id-ID, it-IT, ja-JP, ko-KR, pt-BR, ru-RU, ua-UA, vi-VN, zh-CN.
-
No Audio/Video Input: Image generation does not support audio or video inputs.
-
Image Count: May not always follow the exact number of requested output images.
-
Input Images: gemini-2.5-flash-image works best with up to 3 images; gemini-3-pro-image-preview supports 5 high-fidelity images (up to 14 total).
-
Watermarks: All generated images include a SynthID watermark.
Known Quality Limitations
These are areas where the model may produce imperfect results:
-
Visual and text fidelity: Rendering small text, fine details, and producing accurate spellings may not work perfectly. For best results, generate text first, then ask for an image containing that text.
-
Data and factual accuracy: Always verify the factual accuracy of data-driven visuals like diagrams and infographics.
-
Translation and localization: Multilingual text generation may make grammar mistakes or miss specific cultural nuances.
-
Complex edits and image blending: Advanced editing tasks like blending or lighting changes can sometimes produce unnatural artifacts.
-
Character features: While usually reliable, character consistency across edits may vary.
Troubleshooting
-
No API key found: Ensure you have a .env file in your workspace root or skill directory with GEMINI_API_KEY .
-
Model errors: Ensure you are using the correct model ID (gemini-2.5-flash-image or gemini-3-pro-image-preview ).
-
Resolution errors: 1K , 2K , 4K must be uppercase. Only supported on Gemini 3 Pro.