diffusers (Google Colab) で Image-to-Image を使ってみる

こちらのドキュメントに Image-to-Image の使い方があります。

この記事では、上記ドキュメントに沿って Image-to-Image を使ってみます。

はじめに
使い方
- 同じプロンプトで text-to-image を実行したら？
SD 1.5, SDXL, Kandinsky 2.2 の比較をやってみよう
- SD 1.5
- SDXL
Kandinsky 2.2
パイプラインパラメータ
- strength
- guidance_scale
まとめ

はじめに

Image-to-Image は、テキストプロンプトに加えて、画像を渡します。
画像＋プロンプトでそれに沿った画像が出力されるはずです。

使い方

ドキュメント通りの記述ですが、以下の Python スクリプトを実行します。

import torch
from diffusers import AutoPipelineForImage2Image
from diffusers.utils import load_image, make_image_grid

# パイプラインの準備
model_id = "kandinsky-community/kandinsky-2-2-decoder"
pipe = AutoPipelineForImage2Image.from_pretrained(
    model_id, 
    torch_dtype=torch.float16, 
    use_safetensors=True
).to("cuda")

# 詳しくはわからないがメモリ節約系らしい
pipe.enable_model_cpu_offload()

# インプット画像
init_image_url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/cat.png"
init_image = load_image(init_image_url)

# パイプラインの実行
prompt = "cat wizard, gandalf, lord of the rings, detailed, fantasy, cute, adorable, Pixar, Disney, 8k"
image = pipe(prompt, image=init_image).images[0]
make_image_grid([init_image, image], rows=1, cols=2)

パイプラインには AutoPipelineForImage2Image を使っています
モデルには kandinsky-community/kandinsky-2-2-decoder を使っています
インプット画像には リアル猫 を投入します
プロンプトには「cat wizard」「gandalf」「lord of the rings」「fantasy」などと見受けられます

ロード・オブ・ザ・リング風のファンタジーな猫の魔法使いが生まれそうです

seed が異なるので、ドキュメントとは異なる猫の魔法使いが生まれましたが、
だいたい同じ印象の絵です。

同じプロンプトで text-to-image を実行したら？

import torch
from diffusers import AutoPipelineForText2Image

# パイプラインの準備
model_id = "kandinsky-community/kandinsky-2-2-decoder"
pipe = AutoPipelineForText2Image.from_pretrained(
    model_id, 
    torch_dtype=torch.float16, 
    use_safetensors=True
).to("cuda")

# パイプラインの実行
prompt = "cat wizard, gandalf, lord of the rings, detailed, fantasy, cute, adorable, Pixar, Disney, 8k"
image = pipe(prompt).images[0]
image

Text-to-Image でも十分過ぎるものが生まれてしまいました！
Image-to-Image の強さがわからないですね！

SD 1.5, SDXL, Kandinsky 2.2 の比較をやってみよう

SD 1.5

import torch
from diffusers import AutoPipelineForImage2Image
from diffusers.utils import make_image_grid, load_image

model_id = "runwayml/stable-diffusion-v1-5"
pipe = AutoPipelineForImage2Image.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    variant="fp16",
    use_safetensors=True
).to("cuda")
pipe.enable_model_cpu_offload()

# prepare image
init_image_url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/img2img-init.png"
init_image = load_image(init_image_url)

# pass prompt and image to pipeline
prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
image = pipe(prompt, image=init_image).images[0]
make_image_grid([init_image, image], rows=1, cols=2)

ドキュメントと結構違うのが出ますね。

SDXL

import torch
from diffusers import AutoPipelineForImage2Image
from diffusers.utils import make_image_grid, load_image

model_id = "stabilityai/stable-diffusion-xl-refiner-1.0"
pipe = AutoPipelineForImage2Image.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    variant="fp16",
    use_safetensors=True
).to("cuda")
pipe.enable_model_cpu_offload()

# prepare image
init_image_url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/img2img-sdxl-init.png"
init_image = load_image(init_image_url)

# pass prompt and image to pipeline
prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
image = pipe(prompt, image=init_image, strength=0.5).images[0]
make_image_grid([init_image, image], rows=1, cols=2)

ドキュメントよりも元絵っぽいですね。
これは、Text-to-Image が気になる出力結果な気がしますが、
なんとなく Text-to-Iamge では全然違うものが出そうです

Kandinsky 2.2

import torch
from diffusers import AutoPipelineForImage2Image
from diffusers.utils import make_image_grid, load_image

model_id = "kandinsky-community/kandinsky-2-2-decoder"
pipe = AutoPipelineForImage2Image.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    use_safetensors=True
).to("cuda")
pipe.enable_model_cpu_offload()

# prepare image
init_image_url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/img2img-init.png"
init_image = load_image(init_image_url)

# prepare image
prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
image = pipe(prompt, image=init_image).images[0]
make_image_grid([init_image, image], rows=1, cols=2)

これはドキュメントと似た雰囲気のものが出ましたね

パイプラインパラメータ

色々お試ししましたが、こっちのほうが重要そうです。

strength

生成された画像がインプット画像にどの程度似ているかを決定する値
- strength 値が高いほど、インプット画像と異なる画像が生成される
  - strength=1.0 は、初期イメージが多かれ少なかれ無視されるとのこと
- strength 値が低いほど、生成された画像が初期画像に似ている

上記の Image-to-Image 例では、
SD1.5 指定なし , SDXL strength=0.5 , Kandinsky2.2 指定なし という感じでした
指定なしの場合は デフォルト 0.8 っぽいので、一番元絵に近かった SDXL は納得です。

guidance_scale

生成された画像とテキストプロンプトをどの程度厳密にするかを制御する値
- guidance_scale 値が大きいほど、生成された画像がプロンプトと一致する
- guidance_scale 値が小さいほど、プロンプトから離れていくとのこと

なるほど～。高くすればするほどプロンプトに忠実になるのかぁ～
知らないで使ってました。

guidance_scale のデフォルト値は 7.5 っぽい

まとめ

Image-to-Image も簡単に出来る！ strength 及び guidance_scale 重要そう！

次回は Image-to-upscaler-to-super-resolution この辺をちゃんと調べたいです
また、そのドキュメントより先には id:touch-sp さんから教えてもらった Prompt weighting guide がありますね...!!!
Text-to-Image のドキュメントにも Prompt weighting guide が載っているので、全然ドキュメントを読んでないことがわかります 🙃