How Generative AI Works w/ Images/Videos (Tech)

Explore Generative AI advancements in text generation models and video creation from text

Peleg Aran
November 17, 2022

What is Generative AI?

In short - technology to create new content by utilizing existing text, audio files or images -  the machine is generating something new rather than analyzing something that already exists.

The big acceleration of this field that happened in the recent years is due to two main reasons:

  • Models that developed finally reached human level results, or even a super  human level
  • Computing processes in the cloud have become drastically cheaper .

These two have led the research companies to allow access to developers to build applications.

In this article I’ll be focus on technologies that relevant for text and visuals use cases.

Playing with DALL·E text to image - “Teddy bear is swimming in the ocean and goes underwater. The teddy bear keeps swimming under the water with colorful fishes.”

Key Concepts

Before starting digging into the different models that exist in the market today, let’s explain three concepts that are repeated in the different algorithms.

A neural network architecture which invented by Google in 2017 (first described on the paper “Attention Is All You Need”). It uses attention mechanism - meaning the model can be trained to read many words (a sentence or full paragraph), and  their context - it pays attention to the relation between word to another one. In order to do this it uses Encoders & Decoders with attention mechanism - for each input part it consider relevancy of other parts.

A component that takes input sequence, text for example and maps it into dimensional space (mathematical representation like vector).

A components that turns the encoder vector into an output sequence.

Generate Text

The purpose of developing generating text models are to support different NLP tasks, but generally to produce text like human. The most common use cases are: creative writing, code  generation, content creation and customer service. There are few trained models around the internet which the most popular at the moment is GPT-3.

Developed by OpenAI, indexed 45TB of text data (English only) with 175 billion parameters.Parameters are the parts of the model learned from historical training data and essentially define the skill of the model on a problem, such as generating text.

GPT-3 is a language model, meaning it is able to guess what the next word should be in a sentence. What makes this model unique is the lack of fine tuning - which sounds weird right? usually the more we tune the model to specific use case, the more we get good results.

Well, most of NLP models needed to be fine tuned to certain task and certain data - guessing the next word for example. It was the first time that big unsupervised model wins a fine tuned specific task model.

GPT-3 trained only on the task of what the next word should be, by doing that it learns how language works, therefore it can perform many other NLP tasks.

Its model based on Transformers but only with decoder layers which trained on 300 billion tokens (words or part of them) that collected for the Internet and also books.

This model made so much noise in the world that in such a short time many tools and products were made with GPT-3.

Other models that existing for different purposes and gaining momentum are:

LaMDA by Google
ts main purpose is for text completion and text composing (chat bots, etc..). This module built on Transformer model trained on dialogue and conversations data.

Its main purpose is for classification tasks, entities extraction, and also question answering. It based on Transformer but with only encoder layers, meaning it uses only attention layer, therefore this a model can be trained easily on different languages.

Wu-Dao by Beijing Academy
Built on similar architecture of GPT-3 but has 1.75 trillion parameters. It trained on 4.9 terabytes of images and texts (Chinese and English) and it considered the biggest AI module at the moment.

Built on similar architecture of GPT-3 but trained on smaller but high quality data set with ~6 billion parameters. In terms of performance and accuracy GPT-3 wins. The advantage of GPT-J is that its an open source and free to use.

Generate an Image

The era before diffusion modelsGenerative models of image became more trendy since generative adversarial networks (GAN) were introduced in 2014. GANs are smart way of training with two sub models that run in competition with one another. The generator model trained to generate new examples while the discriminator model tries to classify examples as real or fake. Both models are trained together in a zero-sum game, until discriminator is fails 50% of its trials.

The problem of this model is that is very hard to train to do something creative and interesting. Once it solved the problem and it beats the game, there is no incentive to generate something very different.

To solve these problem researches came up with diffusion model. A model that takes piece of data and gradually add noise to it until it is not recognizably anymore. Then it tries to reconstruct the image from that point to the original form. By doing that the model learn how to generate an image from any data. When you train this model you can train it on different images with different amount of noise. In that way the model learn to predict for a new image what will be the noise in a given level we want.

Taken from nvidia developer website.

“Art for the people…”
In January 2021 OpenAI unveiled the original DALL·E, the tool impressed AI experts and public thanks to its ability to turn any written description into a unique image.

Its main purpose is to generate realistic images according to text description in natural language.It can generate new unique image, add new information to existing image or create different variations of an image.The architecture consist two main parts:

  1. Prior to convert caption into image embedding (vector)
  2. Decoder which turn the image representation into an image

The text and image representation are coming from other technology called CLIP:A neural network model that returns the best caption that described an image. It’s a contrastive model, meaning it doesn’t classify an image but match it to a caption. It trained by image and caption pairs from the Internet. Using two encoders that turn image and text to their embedding representation, it finds the highest similarity value intersecting the two vectors.(above the dotted line you can see the CLIP training process)

The Prior which is a diffusion mode, takes the CLIP text embedding and create a CLIP image embedding.

The Decoder is also a diffusion model combined with GLIDE model which support image embedding. So the model of the decoder in this case is according the original image and in addition the caption and the CLIP embedding.(below the dotted line you can see the text-to-image process with prior and decoder)

DALL·E2 is currently available as an API for everyone.

High-level overview of how DALL·E2 works. Taken from dalle-2 papers.

Stable Diffusion
An open source alternative to DALL·E2  developed by Based on stable diffusion models. The mode was trained on over than 5 billion image-text pairs taken from LAION-5B, a publicly available dataset derived from Common Crawl (a non profit organization that crawls the web).So how it different from DALL·E2? Well, if I need to pin point something it will be the resolution. If you need a high resolution image Stable Diffusion can get up to 1024x1024 (!)

Stable Diffusion is available as an API.

Playing with Stable Diffusion. A photorealistic of Pikachu fine dining with a view to from tower.

An independent research lab that creates images from textual descriptions. Their tool is currently in open beta, and they have big community in Discord that generate using a bot command artworks (yeah discord is currently their UI to try their service).

Generate a Video

There are some projects that takes it to the next level of Generative AI. Generating videos from text are very challenging task to do due to various factors, such as high computational cost, video length and lack of high quality training data.

Phenaki (by Google)
This project is very ambitious because, unlike other projects, it focuses on creating long form videos. I would call it more text-to-video-sense where you describe in detail a sense you want to generate and it can generate a whole video from multiple senses. It can also generate a video from a still image and text description. Their papers have not much details about their models but for the first time we have a study about generating a video according to a time variable prompts, meaning their algorithm knows how to relate to the scenes as sequence of time.

2 minutes video generated using long sequence of prompts. Taken from Phenaki website.

Make-A-Video (by Meta)
A model that generates video from text. Trained their model on 2.3 billion text-image pairs and 20 million unlabeled videos.

A golden retriever eating ice cream on a beautiful tropical beach at sunset, high resolution. Taken from meta website.

Imagen Video (by Google Research)
One of the promises of this project is to generate a high resolution video output. Their model is based on cascade diffusion models, which is effective method for scaling diffusion model to a very high resolution output.

What's Next?

There are useful Generative AI applications today for various use cases like images, text, audio, code and video. Once developers will get access to text to video models we will see new wave of applications.
Very interesting case can be when it will be deployed in Metaverse  and Web3 which already includes a lot of digital content assets.