Automating creation of music videos

Using AI to generate music videos from audio files

  ·   3 min read


Inspired by some ideas of others, I created a script that combines OpenAI’s Whisper with Stable Diffusion to automatically generate a music video given any audio file that contains text. It heavily relies on the Keras tutorial of François Chollet. The key idea from that tutorial is that we can walk through latent space of a difussion model with Stable Diffusion. In other words, we can interpolate between prompts, which can be the lyrics of a song.

Whisper transcription

First, I am using the transcription model Whisper to extract text from any audio file. Instead of looking up the lyrics, this way of doing it also gives us the timestamps for each sentence. That will help us with calculating how many frames we need to interpolate from sentence to sentence.

The following runs Whisper from the command line and saves the lyrics to a text file.

os.system(
  f"whisper {file} --model medium  --language English "\
  f"--output_dir {str(save_folder)} --verbose False --device {device}"
)

Next up, we parse the text file to get the individual sentences.

with open(file) as f:
    data = f.read()

lines = list(srt.parse(data))

lyrics = [line.content for line in lines]
lyrics = [line.replace(",","") for line in lyrics]

To make the styling more consistent later when generating images using the lyrics as prompt, I’m appending a style word after each sentence. For instance, the style word can be oil painting.

if style is not None:
    lyrics = [line + ', ' + style for line in lyrics]

Furthermore, we can calculate the time between sentences using the timestamps.


offsets = [datetime.timedelta(0.)] + [line.end for line in lines]

num_interpolation_steps = [
  int(np.ceil((end-start).total_seconds() * fps))
  for start, end in zip(offsets, offsets[1:])
]

Video generation

Now we have all our lyrics, and we know how many images we need to make to make a smooth video that aligns with the audio, we can start generating the images using Stable Diffusion.

# Enable mixed precision
# (only do this if you have a recent NVIDIA GPU)
keras.mixed_precision.set_global_policy("mixed_float16")

# Instantiate the Stable Diffusion model
model = keras_cv.models.StableDiffusion(
  jit_compile=True, img_width=img_width, img_height=img_height)

noise = tf.random.normal((img_height // 8,  img_width // 8, 4), seed=seed)

# repeat last frame
num_frames += [0]
prompts += [prompts[-1]]

enc_prev = tf.squeeze(model.encode_text(prompts[0]))
i = 0
all_images = []
with tqdm.tqdm(total=sum(num_frames)) as pbar:
for j, (prompt, steps) in enumerate(zip(prompts[1:], num_frames)):

enc = tf.squeeze(model.encode_text(prompt))

interp_encodings = tf.linspace(enc_prev, enc, steps)
enc_prev = enc

for batch in range(0, steps, batch_size):
index = tf.range(batch, min(batch + batch_size, steps))
encoding_batch = tf.gather(interp_encodings, index)
images_batch = encoding_batch
images = model.generate_image(
    encoding_batch,
    batch_size=len(encoding_batch),
    num_steps=25,
    diffusion_noise=noise,
    unconditional_guidance_scale=unconditional_guidance_scale,
)

images_batch = [Image.fromarray(image) for image in images]
[img.save(folder / f'{_i}.png')
for _i, img in zip(range(i, i + len(images_batch)), images_batch)]
i += len(images_batch)

all_images += images_batch

pbar.update(len(encoding_batch))

To speed up the process the code is made such that the prompts can be handled in batches. Now all that is left is converting all the generated images into a single video.

frameSize = (img_width, img_height)
out_file = folder / "output_video.mp4"
out = cv2.VideoWriter(str(out_file), cv2.VideoWriter_fourcc(*'avc1'), fps, frameSize)
files = sorted(glob.glob(f'{folder}/*.png'), key=lambda x: int(Path(x).stem))
for filename in files:
    img = cv2.imread(str(filename))
    out.write(img)

out.release()

Result

Check out the full code here: https://github.com/tristan-deep/music_video.

And the end result for I am the Walrus, the Beatles. Unfortunately the music got copyright striked. So you would have to play the music yourself ;).