ComfyUI Text to Image: Complete Guide & Troubleshooting
Step-by-step guide to generating AI images from text prompts in ComfyUI — with prompt tips, parameter tuning, and fixes for common issues.
What is Text to Image?
Text to Image is the most fundamental AI art workflow — you describe what you want in words, and the AI model generates a matching image. In ComfyUI, this is built as a node graph where each node handles one step of the generation pipeline.
The process involves three core elements:
- A generation model — the neural network that creates the image (e.g. Stable Diffusion 1.5)
- Latent space — the compressed mathematical space where the image gradually takes shape
- Prompts — your text descriptions split into positive (desired elements) and negative (things to avoid)
Prerequisites
Before starting, make sure you have:
- ComfyUI installed and running (installation guide)
- At least one checkpoint model in your
ComfyUI/models/checkpointsfolder
For this tutorial we'll use the SD1.5 model. You can download v1-5-pruned-emaonly-fp16.safetensors from HuggingFace.
If you installed ComfyUI Desktop, you can download models directly through the interface without manual file management.
Building the Workflow
The default text-to-image workflow uses six types of nodes (with the CLIP Text Encode appearing twice — once for the positive prompt and once for the negative). Here's what each one does:
Load Checkpoint
Loads your AI model. A checkpoint typically bundles three components:
| Component | Role |
|---|---|
| MODEL (UNet) | Predicts and removes noise during the diffusion process |
| CLIP | Converts your text prompts into numerical vectors the model understands |
| VAE | Translates between latent space (where the model works) and pixel space (what you see) |
Empty Latent Image
Sets the canvas size. This node creates a blank latent space filled with random noise — the starting point for generation. The width and height here determine your final image dimensions.
For SD1.5, stick to 512×512 for best results. The model was trained at this resolution.
CLIP Text Encode (x2)
You need two of these — one for your positive prompt (what you want) and one for your negative prompt (what to avoid). The CLIP encoder converts your text into semantic vectors that guide the denoising process.
KSampler
This is the heart of the workflow. It takes the noisy latent, the model, and your prompt conditions, then iteratively denoises the image over multiple steps.
Key parameters:
| Parameter | What it controls |
|---|---|
| seed | Randomization — same seed + same settings = same image |
| steps | Number of denoising iterations. More steps = finer detail, slower generation |
| cfg | How strictly the model follows your prompt. Too low = ignores prompt. Too high = artifacts |
| denoise | Noise strength. Keep at 1.0 for text-to-image (full generation from noise) |
VAE Decode
Converts the denoised latent back into a viewable image.
Save Image
Displays and saves your result to the ComfyUI/output folder.
Writing Effective Prompts
Good prompts make a huge difference in output quality. Here are practical tips for SD1.5:
Do:
- Write in English for best results
- Use comma-separated phrases, not full sentences
- Be specific: "golden sunset over calm ocean" beats "nice landscape"
- Add quality boosters:
masterpiece, best quality, highly detailed - Use weights for emphasis:
(golden hour:1.2)makes that concept stronger
Don't:
- Write long paragraphs — the model responds better to concise keywords
- Forget negative prompts — they're essential for avoiding common artifacts
Example: Anime Style
Positive:
anime style, 1girl, long pink hair, cherry blossom background,
soft lighting, intricate details, masterpiece, best qualityNegative:
low quality, blurry, deformed hands, extra fingersExample: Photorealistic Portrait
Positive:
(ultra realistic portrait:1.3), elegant woman,
soft cinematic lighting, (golden hour:1.2),
shallow depth of field, (skin texture:1.3),
warm color gradingNegative:
deformed, cartoon, anime, plastic skin, overexposed,
blurry, extra fingersHow It Works Under the Hood
Text-to-image is a reverse diffusion process:
- Start with pure random noise in latent space
- The model predicts what noise to remove at each step
- Your text prompts (encoded as vectors) steer the denoising direction
- After all steps complete, the VAE decodes the result into pixels
The latent space is a compressed mathematical representation — much smaller than the actual image. This is why diffusion models can run on consumer hardware. Think of it like working with a sketch (latent) before painting the final piece (pixels).
About SD1.5
Stable Diffusion 1.5 is one of the most widely used open-source image generation models:
- Size: ~4 GB — runs on GPUs with 6 GB+ VRAM
- Sweet spot: 512×512 resolution
- Ecosystem: Massive library of LoRAs, ControlNets, and community fine-tunes
- Trade-offs: Can struggle with hands, complex lighting, and resolutions above 512px
Despite newer models like SDXL and Flux, SD1.5 remains an excellent starting point for learning ComfyUI because of its speed and hardware accessibility.
Common Issues and Fixes
Output is blurry or low quality
- Increase steps — try 25–30 instead of the default 20
- Raise cfg — try 7–9 for sharper prompt adherence
- Add quality keywords —
masterpiece, best quality, highly detailed, 4kin your positive prompt - Check resolution — SD1.5 works best at 512×512. Going higher without upscaling often degrades quality
Hands and fingers look deformed
This is a known limitation of SD1.5. Mitigations:
- Add
deformed hands, extra fingers, bad anatomyto your negative prompt - Use a hand-fixing LoRA (e.g. "detail tweaker" or "hand fix" LoRAs from Civitai)
- Generate at 512×512 and upscale afterward
Output ignores my prompt
- cfg too low — increase to 7–12 for stronger prompt following
- Too many concepts — simplify your prompt. Fewer, more specific keywords work better than long descriptions
- Wrong model — some checkpoints are fine-tuned for specific styles. Anime models won't produce photorealism well
"Load Checkpoint" shows null or empty
- Verify your
.safetensorsfile is inComfyUI/models/checkpoints/ - Refresh ComfyUI (F5) or restart it after adding new models
- Check the file isn't corrupted (incomplete download)
Generation is very slow
- VRAM insufficient — try launching with
--lowvramflag - Too many steps — 20 steps is fine for quick iterations; use 30+ only for final renders
- Large resolution — generate at 512×512 and upscale instead of generating at 1024×1024
Next Steps
Now that you can generate images from text, explore these workflows:
- Image to Image — Use a reference image to guide generation
- LoRA Guide — Fine-tune your outputs with lightweight model adapters
- Upscale Guide — Increase resolution with AI upscaling
Source References
Start with Wonderful Launcher if this issue touches your real ComfyUI environment. Use the docs to understand the fix, and use the app to inspect the machine you already have.
Download Wonderful LauncherDid this fix your issue?
Your answer helps prioritize verified ComfyUI repairs.