LogoWonderful Launcher
  • Home
  • Pricing
  • Docs
  • Download

ComfyUI Text to Image: Complete Guide & Troubleshooting

Needs verification

Step-by-step guide to generating AI images from text prompts in ComfyUI — with prompt tips, parameter tuning, and fixes for common issues.

What is Text to Image?

Text to Image is the most fundamental AI art workflow — you describe what you want in words, and the AI model generates a matching image. In ComfyUI, this is built as a node graph where each node handles one step of the generation pipeline.

The process involves three core elements:

  • A generation model — the neural network that creates the image (e.g. Stable Diffusion 1.5)
  • Latent space — the compressed mathematical space where the image gradually takes shape
  • Prompts — your text descriptions split into positive (desired elements) and negative (things to avoid)

Prerequisites

Before starting, make sure you have:

  1. ComfyUI installed and running (installation guide)
  2. At least one checkpoint model in your ComfyUI/models/checkpoints folder

For this tutorial we'll use the SD1.5 model. You can download v1-5-pruned-emaonly-fp16.safetensors from HuggingFace.

If you installed ComfyUI Desktop, you can download models directly through the interface without manual file management.

Building the Workflow

The default text-to-image workflow uses six types of nodes (with the CLIP Text Encode appearing twice — once for the positive prompt and once for the negative). Here's what each one does:

Load Checkpoint

Loads your AI model. A checkpoint typically bundles three components:

ComponentRole
MODEL (UNet)Predicts and removes noise during the diffusion process
CLIPConverts your text prompts into numerical vectors the model understands
VAETranslates between latent space (where the model works) and pixel space (what you see)

Empty Latent Image

Sets the canvas size. This node creates a blank latent space filled with random noise — the starting point for generation. The width and height here determine your final image dimensions.

For SD1.5, stick to 512×512 for best results. The model was trained at this resolution.

CLIP Text Encode (x2)

You need two of these — one for your positive prompt (what you want) and one for your negative prompt (what to avoid). The CLIP encoder converts your text into semantic vectors that guide the denoising process.

KSampler

This is the heart of the workflow. It takes the noisy latent, the model, and your prompt conditions, then iteratively denoises the image over multiple steps.

Key parameters:

ParameterWhat it controls
seedRandomization — same seed + same settings = same image
stepsNumber of denoising iterations. More steps = finer detail, slower generation
cfgHow strictly the model follows your prompt. Too low = ignores prompt. Too high = artifacts
denoiseNoise strength. Keep at 1.0 for text-to-image (full generation from noise)

VAE Decode

Converts the denoised latent back into a viewable image.

Save Image

Displays and saves your result to the ComfyUI/output folder.

Writing Effective Prompts

Good prompts make a huge difference in output quality. Here are practical tips for SD1.5:

Do:

  • Write in English for best results
  • Use comma-separated phrases, not full sentences
  • Be specific: "golden sunset over calm ocean" beats "nice landscape"
  • Add quality boosters: masterpiece, best quality, highly detailed
  • Use weights for emphasis: (golden hour:1.2) makes that concept stronger

Don't:

  • Write long paragraphs — the model responds better to concise keywords
  • Forget negative prompts — they're essential for avoiding common artifacts

Example: Anime Style

Positive:

anime style, 1girl, long pink hair, cherry blossom background,
soft lighting, intricate details, masterpiece, best quality

Negative:

low quality, blurry, deformed hands, extra fingers

Example: Photorealistic Portrait

Positive:

(ultra realistic portrait:1.3), elegant woman,
soft cinematic lighting, (golden hour:1.2),
shallow depth of field, (skin texture:1.3),
warm color grading

Negative:

deformed, cartoon, anime, plastic skin, overexposed,
blurry, extra fingers

How It Works Under the Hood

Text-to-image is a reverse diffusion process:

  1. Start with pure random noise in latent space
  2. The model predicts what noise to remove at each step
  3. Your text prompts (encoded as vectors) steer the denoising direction
  4. After all steps complete, the VAE decodes the result into pixels

The latent space is a compressed mathematical representation — much smaller than the actual image. This is why diffusion models can run on consumer hardware. Think of it like working with a sketch (latent) before painting the final piece (pixels).

About SD1.5

Stable Diffusion 1.5 is one of the most widely used open-source image generation models:

  • Size: ~4 GB — runs on GPUs with 6 GB+ VRAM
  • Sweet spot: 512×512 resolution
  • Ecosystem: Massive library of LoRAs, ControlNets, and community fine-tunes
  • Trade-offs: Can struggle with hands, complex lighting, and resolutions above 512px

Despite newer models like SDXL and Flux, SD1.5 remains an excellent starting point for learning ComfyUI because of its speed and hardware accessibility.

Common Issues and Fixes

Output is blurry or low quality

  • Increase steps — try 25–30 instead of the default 20
  • Raise cfg — try 7–9 for sharper prompt adherence
  • Add quality keywords — masterpiece, best quality, highly detailed, 4k in your positive prompt
  • Check resolution — SD1.5 works best at 512×512. Going higher without upscaling often degrades quality

Hands and fingers look deformed

This is a known limitation of SD1.5. Mitigations:

  • Add deformed hands, extra fingers, bad anatomy to your negative prompt
  • Use a hand-fixing LoRA (e.g. "detail tweaker" or "hand fix" LoRAs from Civitai)
  • Generate at 512×512 and upscale afterward

Output ignores my prompt

  • cfg too low — increase to 7–12 for stronger prompt following
  • Too many concepts — simplify your prompt. Fewer, more specific keywords work better than long descriptions
  • Wrong model — some checkpoints are fine-tuned for specific styles. Anime models won't produce photorealism well

"Load Checkpoint" shows null or empty

  • Verify your .safetensors file is in ComfyUI/models/checkpoints/
  • Refresh ComfyUI (F5) or restart it after adding new models
  • Check the file isn't corrupted (incomplete download)

Generation is very slow

  • VRAM insufficient — try launching with --lowvram flag
  • Too many steps — 20 steps is fine for quick iterations; use 30+ only for final renders
  • Large resolution — generate at 512×512 and upscale instead of generating at 1024×1024

Next Steps

Now that you can generate images from text, explore these workflows:

  • Image to Image — Use a reference image to guide generation
  • LoRA Guide — Fine-tune your outputs with lightweight model adapters
  • Upscale Guide — Increase resolution with AI upscaling

Source References

  • ComfyUI Text to Image tutorial
  • ComfyUI workflow concept
  • ComfyUI Getting Started with AI Image Generation

Start with Wonderful Launcher if this issue touches your real ComfyUI environment. Use the docs to understand the fix, and use the app to inspect the machine you already have.

Download Wonderful Launcher

Did this fix your issue?

Your answer helps prioritize verified ComfyUI repairs.

Table of Contents

What is Text to Image?
Prerequisites
Building the Workflow
Load Checkpoint
Empty Latent Image
CLIP Text Encode (x2)
KSampler
VAE Decode
Save Image
Writing Effective Prompts
Example: Anime Style
Example: Photorealistic Portrait
How It Works Under the Hood
About SD1.5
Common Issues and Fixes
Output is blurry or low quality
Hands and fingers look deformed
Output ignores my prompt
"Load Checkpoint" shows null or empty
Generation is very slow
Next Steps
Source References