Text-to-Image
Generates images from input text. These models can be used to generate and modify images based on text prompts.
Input
A city above clouds, pastel colors, Victorian style
About Text-to-Image
Use Cases
Data Generation
Businesses can generate data for their their use cases by inputting text and getting image outputs.
Immersive Conversational Chatbots
Chatbots can be made more immersive if they provide contextual images based on the input provided by the user.
Creative Ideas for Fashion Industry
Different patterns can be generated to obtain unique pieces of fashion. Text-to-image models make creations easier for designers to conceptualize their design before actually implementing it.
Architecture Industry
Architects can utilise the models to construct an environment based out on the requirements of the floor plan. This can also include the furniture that has to be placed in that environment.
Task Variants
You can contribute variants of this task here.
Inference
You can use diffusers pipelines to infer with text-to-image
models.
from diffusers import StableDiffusionPipeline, EulerDiscreteScheduler
model_id = "stabilityai/stable-diffusion-2"
scheduler = EulerDiscreteScheduler.from_pretrained(model_id, subfolder="scheduler")
pipe = StableDiffusionPipeline.from_pretrained(model_id, scheduler=scheduler, torch_dtype=torch.float16)
pipe = pipe.to("cuda")
prompt = "a photo of an astronaut riding a horse on mars"
image = pipe(prompt).images[0]
You can use huggingface.js to infer text-to-image models on Hugging Face Hub.
import { HfInference } from "@huggingface/inference";
const inference = new HfInference(HF_TOKEN);
await inference.textToImage({
model: "stabilityai/stable-diffusion-2",
inputs: "award winning high resolution photo of a giant tortoise/((ladybird)) hybrid, [trending on artstation]",
parameters: {
negative_prompt: "blurry",
},
});
Useful Resources
Model Inference
- Hugging Face Diffusion Models Course
- Getting Started with Diffusers
- Text-to-Image Generation
- Using Stable Diffusion with Core ML on Apple Silicon
- A guide on Vector Quantized Diffusion
- 🧨 Stable Diffusion in JAX/Flax
- Running IF with 🧨 diffusers on a Free Tier Google Colab
- Introducing Würstchen: Fast Diffusion for Image Generation
- Efficient Controllable Generation for SDXL with T2I-Adapters
- Welcome aMUSEd: Efficient Text-to-Image Generation
Model Fine-tuning
- Finetune Stable Diffusion Models with DDPO via TRL
- LoRA training scripts of the world, unite!
- Using LoRA for Efficient Stable Diffusion Fine-Tuning
This page was made possible thanks to the efforts of Ishan Dutta, Enrique Elias Ubaldo and Oğuz Akif.
Compatible libraries
Note One of the most powerful image generation models that can generate realistic outputs.
Note A powerful yet fast image generation model.
Note A text-to-image model that can generate coherent text inside image.
Note A powerful text-to-image model.
Note RedCaps is a large-scale dataset of 12M image-text pairs collected from Reddit.
Note Conceptual Captions is a dataset consisting of ~3.3M images annotated with captions.
Note A powerful text-to-image application.
Note A text-to-image application to generate comics.
Note A text-to-image application that can generate coherent text inside the image.
Note A powerful yet very fast image generation application.
Note A gallery to explore various text-to-image models.
Note An application to generate realistic images given photos of a person and a prompt.
- IS
- The Inception Score (IS) measure assesses diversity and meaningfulness. It uses a generated image sample to predict its label. A higher score signifies more diverse and meaningful images.
- FID
- The Fréchet Inception Distance (FID) calculates the distance between distributions between synthetic and real samples. A lower FID score indicates better similarity between the distributions of real and generated images.
- R-Precision
- R-precision assesses how the generated image aligns with the provided text description. It uses the generated images as queries to retrieve relevant text descriptions. The top 'r' relevant descriptions are selected and used to calculate R-precision as r/R, where 'R' is the number of ground truth descriptions associated with the generated images. A higher R-precision value indicates a better model.