ずんだもんの画像データで UniDiffuser の Image-to-Text を試してみる

zako-lab929.hatenablog.com

昨日 BLIP というもので 自動キャプショニング を試してみました。

これは Image から Text 出力する Image-to-Text の類だと思いますので、かなり前にやった UniDiffser の Image-to-Text でやったらどうなるのか？というのが気になりましたので試してみようと思います。

過去の UniDiffuser を試した記事はこちら

実行結果
実行コード
まとめ

実行結果

教師データ	生成されたキャプションデータ
zundamon (1).png	A cartoon character with green hair and fairy wigs (緑の髪と妖精のかつらをつけた漫画のキャラクター)
zundamon (2).png	Neon Anime Character With Green And White Eyes (緑と白の目をしたネオンカラーのアニメキャラクター)
zundamon (3).png	An anime character with green dress and green eyes (緑のドレスと緑の目をしたアニメキャラクター)
zundamon (4).png	A cartoon girl with green hair and flowers on her head (緑の髪と頭に花をつけた漫画の女の子)
zundamon (5).png	A cartoon girl with green hair and green dress (緑の髪と緑のドレスをつけた漫画の女の子)
zundamon (6).png	A cartoon girl with green hair and short hair (緑の髪とショートヘアの漫画の女の子)
zundamon (7).png	A cartoon girl with green hair and green eyes (緑の髪と緑の目をした漫画の女の子)
zundamon (8).png	A cartoon girl with green hair and green eyes (緑の髪と緑の目をした漫画の女の子)
zundamon (9).png	A cartoon girl with green hair and red eyes (緑の髪と赤目をした漫画の女の子)
zundamon (10).png	An anime girl with green hair and green eyes (緑の髪と緑の目をしたアニメの女の子)
zundamon (11).png	A cartoon girl with green hair and green dress (緑の髪と緑のドレスをつけた漫画の女の子)
zundamon (12).png	A cartoon girl with green hair and white hair (緑の髪と白髪の漫画の女の子)

cartoon gril, cartoon character, green hair, green eyes, red eyes, green dressといった具合で全体的に「アニメキャラ」「髪」「目」「服」のことを強く言っているようです。BLIP よりも出力が単調に見えます？

実行コード

Google Drive のマウント系

from google.colab import drive
drive.mount('/content/drive')

!cp -r /content/drive/MyDrive/trains/dataset .
!rm ./dataset/zundamon/*.txt

この辺は自作 LoRA 作成時点からの使い回しです

diffusers を動かすためのモジュールのインストール

!pip install git+https://github.com/huggingface/diffusers
!pip install -U transformers accelerate peft controlnet_aux onnxruntime-gpu insightface

これもAnimateDiffやIP-Adapterのときに使ってたやつの使い回しです

diffusers pipeline の準備

import torch
from diffusers import UniDiffuserPipeline

pipe = UniDiffuserPipeline.from_pretrained(
    "thu-ml/unidiffuser-v1",
    torch_dtype=torch.float16
).to("cuda")

pipeline 実行

import os
from diffusers.utils import load_image

dir_path = "./dataset/zundamon"
for f in os.listdir(dir_path):
    file_path = f"{dir_path}/{f}"
    if not os.path.isfile(file_path):
        continue

    image = load_image(file_path).resize((512, 512))
    sample = pipe(image=image, num_inference_steps=20, guidance_scale=8.0)
    i2t_text = sample.text[0]
    print(f"{f}: {i2t_text}")

ずんだもん画像群をリスト取得してパイプラインに喰わせてみました