Exploring Sora: Unleashing the Power of AI
Estimated Reading Time: 4 min read
In the vast landscape of technology and innovation, one name that has been making waves is Sora. Whether you're a tech enthusiast, developer, or simply curious about cutting-edge solutions, Sora is a name you should keep on your radar.
## **What is Sora?**
Sora is more than just a word; it represents a revolutionary technology designed to do video editing using AI. Developed by Open AI, Sora is an AI model that can create realistic and imaginative scenes from text instruction. That means you write a text prompt, and it creates a video that matches the description of the prompt. Here's an example from the Open AI:
> Prompt: A movie trailer featuring the adventures of the 30 year old space man wearing a red wool knitted motorcycle helmet, blue sky, salt desert, cinematic style, shot on 35mm film, vivid colors.
To get more examples visit [Open AI](https://openai.com/sora)
## **How Does Sora Work?**
Like text-to-image generative AI models such as [**DALL·E 3**](https://www.datacamp.com/tutorial/a-comprehensive-guide-to-the-dall-e-3-api), [**StableDiffusion**](https://www.datacamp.com/tutorial/stable-diffusion-web-ui-a-comprehensive-user-guide-for-beginners), and [**Midjourney**](https://www.datacamp.com/tutorial/how-to-use-midjourney-a-comprehensive-guide-to-ai-generated-artwork-creation), Sora is a diffusion model. That means that it starts with each frame of the video consisting of static noise, and uses machine learning to gradually transform the images into something resembling the description in the prompt. Sora videos can be up to 60 seconds long.
### **Solving temporal consistency**
One area of innovation in Sora is that it considers several video frames at once, which solves the problem of keeping objects consistent when they move in and out of view. In the following video, notice that the kangaroo's hand moves out of the shot several times, and when it returns, the hand looks the same as before.
### **Combining diffusion and transformer models**
Sora combines the use of a diffusion model with a [**transformer architecture**](https://www.datacamp.com/tutorial/an-introduction-to-using-transformers-and-hugging-face), as used by GPT.
When combining these two model types, Jack Qiao [**noted**](https://github.com/lucidrains/DALLE-pytorch/discussions/375) that "diffusion models are great at generating low-level texture but poor at global composition, while transformers have the opposite problem." That is, you want a GPT-like transformer model to determine the high-level layout of the video frames and a diffusion model to create the details.
In [**a technical article on the implementation of Sora**](https://openai.com/research/video-generation-models-as-world-simulators), OpenAI provides a high-level description of how this combination works. In diffusion models, images are broken down into smaller rectangular "patches." For video, these patches are three-dimensional because they persist through time. Patches can be thought of as the equivalent of "tokens" in large language models: rather than being a component of a sentence, they are a component of a set of images. The transformer part of the model organizes the patches, and the diffusion part of the model generates the content for each patch.
Another quirk of this hybrid architecture is that to make video generation computationally feasible, the process of creating patches uses a [**dimensionality reduction**](https://www.datacamp.com/blog/introduction-to-unsupervised-learning) step so that computation does not need to happen on every single pixel for every single frame.
### Increasing Fidelity of Video with Recaptioning
To faithfully capture the essence of the user's prompt, Sora uses a [**recaptioning**](https://arxiv.org/html/2401.11708v1) technique that is also available in DALL·E 3. This means that before any video is created, GPT is used to rewrite the user prompt to include a lot more detail. Essentially, it's a form of automatic prompt engineering.
## **What are the Limitations of Sora?**
OpenAI notes several limitations of the current version of Sora. Sora does not have an implicit understanding of physics, and so "real-world" physical rules may not always be adhered to.
One example of this is that the model does not understand cause and effect. For example, in the following video of an explosion on a basketball hoop, after the hoop explodes, the net appears to be restored.
### **Unanswered questions on reliability**
The reliability of Sora is currently unclear. All the examples from OpenAI are very high quality, but it is unclear how much cherry-picking was involved. When using text-to-image tools, it is common to create ten or twenty images then choose the best one. It is unclear how many images the OpenAI team generated in order to get the videos shown in their announcement article. If you need to generate hundreds or thousands of videos to get a single usable video, that would be an impediment to adoption. To answer this question, we must wait until the tool is widely available.
- Tags:
- #travel
- #health
- #technology