diffusers v0.15.0 の Text-to-Video Zero と ControleNet の OpenPose を併用する (2)

zako-lab929.hatenablog.com

前回の記事で、 Text-to-Video Zero と ControlNet の併用で「CUDA out of memory」になってしまい断念しましたが、OutOfMemory なのでメモリ使用量を少しでも減らせないかと少しパラメータを変更したところ、なんと動くようになったので、続きです。

動いたパターン1
動いたパターン2
任意のモデルでやってみる！
- クオリティ系プロンプトを入れてみる
- 普通の Text-to-Video Zero でやってみた
まとめ

動いたパターン1

ポーズ動画 から ポーズ画像群 を取得する際の frame_count を 8 から 4 に変更しました。

  import imageio
  from PIL import Image

  # ポーズ動画のパス
  pose_video_path = "./dance1_corr.mp4"

  # ポーズ動画からポーズ画像群を取得する
  reader = imageio.get_reader(pose_video_path, "ffmpeg")
- frame_count = 8
+ frame_count = 4
  pose_images = [Image.fromarray(reader.get_data(i)) for i in range(frame_count)]

コード全体 (折りたたみ)

コントロール画像群の準備

import imageio
from PIL import Image

# ポーズ動画のパス
pose_video_path = "./dance1_corr.mp4"

# ポーズ動画からポーズ画像群を取得する
reader = imageio.get_reader(pose_video_path, "ffmpeg")
frame_count = 4
pose_images = [Image.fromarray(reader.get_data(i)) for i in range(frame_count)]

# fix latents for all frames
latents = torch.randn((1, 4, 64, 64), device="cuda", dtype=torch.float16).repeat(len(pose_images), 1, 1, 1)

ControlNet, Pipeline の準備

import imageio
import torch
from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
from diffusers.pipelines.text_to_video_synthesis.pipeline_text_to_video_zero import CrossFrameAttnProcessor

# ControlNet の準備
controlnet = ControlNetModel.from_pretrained(
    # "lllyasviel/sd-controlnet-openpose",
    "lllyasviel/control_v11p_sd15_openpose",
    torch_dtype=torch.float16
)

# Pipeline の準備
pipe = StableDiffusionControlNetPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    controlnet=controlnet,
    torch_dtype=torch.float16
).to("cuda")

# Set the attention processor
pipe.unet.set_attn_processor(CrossFrameAttnProcessor(batch_size=2))
pipe.controlnet.set_attn_processor(CrossFrameAttnProcessor(batch_size=2))

# fix latents for all frames
latents = torch.randn((1, 4, 64, 64), device="cuda", dtype=torch.float16).repeat(len(pose_images), 1, 1, 1)

パイプライン実行

prompt = "Darth Vader dancing in a desert"
result = pipe(prompt=[prompt] * len(pose_images), image=pose_images, latents=latents).images
imageio.mimsave("video.mp4", result, fps=4)

実行結果

やったー！ついに出力できた！

動いたパターン2

frame_count は 8 のままで、 latents の指定をやめました。

prompt = "Darth Vader dancing in a desert"
- result = pipe(prompt=[prompt] * len(pose_images), image=pose_images, latents=latents).images
+ result = pipe(prompt=[prompt] * len(pose_images), image=pose_images).images
imageio.mimsave("video.mp4", result, fps=4)

コード全体 (折りたたみ)

コントロール画像群の準備

import imageio
from PIL import Image

# ポーズ動画のパス
pose_video_path = "./dance1_corr.mp4"

# ポーズ動画からポーズ画像群を取得する
reader = imageio.get_reader(pose_video_path, "ffmpeg")
frame_count = 8
pose_images = [Image.fromarray(reader.get_data(i)) for i in range(frame_count)]

ControlNet, Pipeline の準備

import imageio
import torch
from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
from diffusers.pipelines.text_to_video_synthesis.pipeline_text_to_video_zero import CrossFrameAttnProcessor

# ControlNet の準備
controlnet = ControlNetModel.from_pretrained(
    # "lllyasviel/sd-controlnet-openpose",
    "lllyasviel/control_v11p_sd15_openpose",
    torch_dtype=torch.float16
)

# Pipeline の準備
pipe = StableDiffusionControlNetPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    controlnet=controlnet,
    torch_dtype=torch.float16
).to("cuda")

# Set the attention processor
pipe.unet.set_attn_processor(CrossFrameAttnProcessor(batch_size=2))
pipe.controlnet.set_attn_processor(CrossFrameAttnProcessor(batch_size=2))


# fix latents for all frames
latents = torch.randn((1, 4, 64, 64), device="cuda", dtype=torch.float16).repeat(len(pose_images), 1, 1, 1)

パイプライン実行

prompt = "Darth Vader dancing in a desert"
result = pipe(prompt=[prompt] * len(pose_images), image=pose_images).images
imageio.mimsave("video.mp4", result, fps=4)

実行結果

latents を指定したときに比べて、背景がパタパタ切り替わっている感じがします。

やっぱり何かを固定していると思って良さそう？

任意のモデルでやってみる！

恒例の MeinaMix を使わせていただきます。

プロンプト

dancing in a desert, A girl, super fine illustration, anime style

「A girl dancing in a desert」だと、思ったよりアニメ調の絵が出なかったので少しプロンプトを追加しました

ギターのときと比べて、ちょっと違う感があります。

クオリティ系のプロンプト・ネガティブプロンプトくらいは追加したほうが良いだろうか？

クオリティ系プロンプトを入れてみる

プロンプト

best quality, masterpiece, dancing in a desert, A girl, super fine illustration, anime style

ネガティブプロンプト

worst quality, ugly, bad anatomy, jpeg artifacts

ちなみに pipe に指定するときは negative_prompt=[negative_prompt] * len(pose_images) のようにする必要がありました。

コントロールイメージ数分指定しなきゃいけないみたい
1枚1枚に同じプロンプトを指定している感じだったんですね

うーん。。。普通の Text-to-Video Zero でやったらどうなるんだろう？

普通の Text-to-Video Zero でやってみた

プロンプト

A girl dancing in a desert

ポーズを指定しないほうが、絵のクオリティは凄くあがりました。

指定されたポーズが苦手なモデルなのかな？

まとめ

フレーム数を減らすなどして、メモリ使用量を減らすことでなんとか出力できました！

その他、以下の Text-to-Video Zero のドキュメントには、 ControlNet の Canny の併用 , Instruct Pix2Pix , DreamBooth などがございました。全部やってると長くなりそうなので OpenPose だけにしておきます。

huggingface.co