Blog Hour

SORA by OpenAI Set to Revolutionize AI Video Generation

OpenAI, the builders of the famous chatbot ChatGPT, have launched SORA, a text-to-video generator model that can create one-minute-long videos based on the command prompt. It can create imaginative and realistic scenes from text instructions.

Sam Altman, the founder of OpenAI, tweeted on X, asking users for inputs and complex prompts for Sora to showcase its capabilities. He also shared the outputs that looked vividly beautiful and realistic.

Visual Data Training

Just like chatGPT, which is based on a large dataset and has tokens, Sora is being trained on a large set of images and videos called visual patches. According to the research team of OpenAI, “while LLMs have text tokens, Sora has visual patches. Patches have previously been shown to be an effective representation of models using visual data. We find that patches are a highly scalable and effective representation for training generative models on diverse types of videos and images.

The Superiority of Sora

Sampling flexibility

Sora can sample videos in various aspect ratios, such as widescreen 1920x1080p and vertical 1080x1920p. This allows Sora to create content that fits perfectly on different devices without any need for manual adjustments. Additionally, Sora can quickly prototype content at lower resolutions before generating the final output at full resolution, all with the same model.

Improved Framing and Composition

Training of Sora is done on native aspect ratios that help improve composition and framing. Other generative AI platforms are trained on videos that are cropped to square, which can lead to the object being partially visible in some videos.

Language Understanding

Text-to-video generator AI models need to be trained on a large dataset with corresponding text captions that make it understandable for the model to identify objects and the video.

Sora was trained on videos that were auto-captioned using another AI model that automatically generates captions for videos. This training based on highly descriptive video texts can help the model produce more accurate videos.

Image Generation

Sora can easily create highly detailed and beautiful images. The model can generate images of variable sizes—up to 2048 x 2048 resolution.

Prompting and Animating using images

Other than text-to-video, Sora can be prompted with pre-existing images or videos, and it can animate images in the blink of an eye, which makes it really versatile.

Video-to-Video Editing

Yes, you read that correctly. Sora can edit videos, too. Video of a car moving in a desert can be edited to look like it is moving on a road in a lush green rainforest. The model can also extend a video, either forward or backward, in time.

It can also merge two videos with entirely different objects and scenes.

A Work in Progress

Although it can generate videos with dynamic camera motion, it means if the camera is in motion, other elements move consistently with it. The research team is working to improve and stabilise it.

Interacting with the world is an arena where work is required as the model evolves and grows. For example, a person takes a bite out of a burger in an AI-generated video, but the burger does not have a bite mark afterwards.

The model can also simulate games smoothly; Sora can control the player and render the digital world at the same time with ease.

Sora has limitations as a simulator, such as inaccurate modelling of physics in basic interactions such as glass shattering. Some interactions, like eating food, only sometimes result in the correct changes in the object state.

However, this is a promising technology that has the potential to change the world of video editing, production and generation, and with time, this is going to get better, just like humans perfecting a skill.