diffusers v0.17.0 の UniDiffuser を使ってみる

この記事では diffusers v0.17.0 のリリースノートにある UniDiffuser というものを使ってみます。

UniDiffuser introduces a multimodal diffusion process that is capable of handling different generation tasks using a single unified approach

Google 翻訳によると UniDiffuser は、単一の統一されたアプローチを使用してさまざまな生成タスクを処理できるマルチモーダル拡散プロセスを導入します。

と記載されています。

UniDiffusers とは
Unconditional Image and Text Generation
Image-to-Text
- やってみる
- 適当な画像でもう1個やってみる
Text Variation
Image Variation
まとめ

UniDiffusers とは

huggingface.co

1つのアプローチで、色んなことができてしまうようです。

できること

・Unconditional Image and Text Generation
・Text-to-Image
・Image-to-Text
・Image Variation
・Text Variation

ここで面白そうなのは Unconditional Image and Text Generation , Image-to-Text , Text Variation ですかね。

Text-to-Image が当たり前のようにできるのは良いとして、
Image Variation はこの前の記事でやった Stable UnCLIPみたいな感じだと思われます。 全然違いました。 (詳細は後述の Image Variation の項目をご参照ください)

zako-lab929.hatenablog.com

1人で色んなことができるのは凄い！

Unconditional Image and Text Generation

無条件の画像とテキストの生成 ということで、ランダムで画像とそれに対応する プロンプト が生成される感じでしょうか？

やってみる

import torch
from diffusers import UniDiffuserPipeline

# パイプラインの準備
pipe = UniDiffuserPipeline.from_pretrained(
    "thu-ml/unidiffuser-v1", 
    torch_dtype=torch.float16
).to("cuda")

# パイプライン実行
sample = pipe(num_inference_steps=20, guidance_scale=8.0)

# 結果
print(sample.text[0])
sample.images[0]

数回生成してみました

(左) A painting of orange flowers in a blue vase | (中) Flooded street with cars parked on the | (右) A white motorcycle is parked on the side of the road

予想通り！

モードを設定することで「画像のみ」または「テキストのみ」を生成することもできるようです。

画像のみを生成

import torch
from diffusers import UniDiffuserPipeline

# パイプラインの準備
pipe = UniDiffuserPipeline.from_pretrained(
    "thu-ml/unidiffuser-v1", 
    torch_dtype=torch.float16
).to("cuda")

# モードを設定
pipe.set_image_mode()

# パイプライン実行
image = pipe(num_inference_steps=20, guidance_scale=8.0).images[0]
image

テキストのみを生成

import torch
from diffusers import UniDiffuserPipeline

# パイプラインの準備
pipe = UniDiffuserPipeline.from_pretrained(
    "thu-ml/unidiffuser-v1", 
    torch_dtype=torch.float16
).to("cuda")

# モードを設定
pipe.set_text_mode()

# パイプライン実行
text = pipe(num_inference_steps=20, guidance_scale=8.0).text[0]
print(text)

なんと A woman in a navy blue bikini となりました。

テキストのみモードなのでビキニのお姉さんをお目にかかることはできず。。。

なお、意図的に両方出力モードにする場合は以下のモードに設定するようです。

pipe.set_joint_mode()

Image-to-Text

画像からテキスト ということで、プロンプトを出力するもののようです。

やってみる

import torch
from diffusers import UniDiffuserPipeline
from diffusers.utils import load_image

# パイプラインの準備
pipe = UniDiffuserPipeline.from_pretrained(
    "thu-ml/unidiffuser-v1", 
    torch_dtype=torch.float16
).to("cuda")

# 投入画像の準備
init_image_url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/unidiffuser/unidiffuser_example_image.jpg"
init_image = load_image(init_image_url).resize((512, 512))

# パイプライン実行
text = pipe(image=init_image, num_inference_steps=20, guidance_scale=8.0)[0]
print(text)

出力テキスト

An astronaut floating in space with the Earth in the background

なお、意図的に Image-to-Text モードにする場合は以下の通りです

pipe.set_image_to_text_mode()

適当な画像でもう1個やってみる

出力テキスト

A yellow bird sitting on a branch with white flowers

Text Variation

テキストのバリエーションを出力してくれるということで、
テキストをインプット すると 類似のテキストをアウトプット するようなものを想像していましたが、全然違うようです。

どうやら Text-to-Image で出力された画像を 前述の Image-to-Text の入力に使用することでテキストを出力してバリエーションを作るようです。

Image Variation

こちらも、本記事の最初に記載した通り Stable UnCLIP のようなものを想像していました。
画像をインプット すると 類似の画像をアウトプット するようなイメージでした。

zako-lab929.hatenablog.com

実際は 前述の Image-to-Text で出力されたテキストを Text-to-Image の入力に使用することで画像を出力してバリエーションを作るようです。

まとめ

モード設定は以下のような感じでした

モード	設定
無条件の画像とテキスト生成	set_joint_mode()
画像のみ生成	set_image_mode()
テキストのみ生成	set_text_mode()
Text-to-Image	set_text_to_image_mode()
Image-to-Text	set_image_to_text_mode()