- VP Land
- Posts
- Sora: GenAI's Lumière Train Moment
Sora: GenAI's Lumière Train Moment
🌐 How Sora Works and What People Aren't Talking About
Powered by
Welcome to VP Land 🌐
Thursday evening my friend texted a link to OpenAI’s announcement of Sora, to which I replied 🥱
Whoops
I thought it was going to be another hype trailer like we saw with Pika 1.0, which had a lot of flash and nice-looking shots but didn’t amount to that much of a change in quality in what could be generated in the real world.
Well, Sora’s demo is a lot different. So I started digging. And researching. And writing.
And that turned into an essay that I’ll just post in its entirety here.
The text-to-video demo videos are honestly the least interesting thing about Sora - the reasons why they look so good, what we can do with existing footage, and what this possibly means for the future of making media are far more interesting, and that’s what this digs into.
Let me know your thoughts, experiences, and if you’re being immediately affected by this.
Regular links and videos will resume next week.
Let’s get into it…
STAGE A
Sora: Video AI Has Left The Station
If you’ve been on the internet over the past two days you’ve undoubtedly encountered the newest, most famous woman in a red dress.
And she isn’t real.
OpenAI just dropped their newest model for creating video from text - Sora.
Like the audience watching the Lumière brother's train arrive at the station, people have been freaking out.
I’m not one for hyperbole but the sample videos OpenAI shared are pretty extraordinary. It’s a big leap across the uncanny valley.
Sure, pay attention to the tiny details in the videos and you’ll notice little things that are off.
But quick glance, focusing on the subject - these are some of the most realistic looking AI videos we’ve seen.
And it happened a lot quicker than just about anyone was expecting (as many have pointed out, the Will Smith spaghetti video was just a year ago…though we haven’t seen Sora spit out its own version of Will Smith eating for reference).
Runway’s CEO Cristóbal Valenzuela sums it up:
A year's worth of progress is now happening in months. Months' worth of progress will start to happen in days. Days’ worth of progress will soon begin to happen in hours.
I’m not going to rehash the examples - there are a ton of threads doing that or just check out Sora’s webpage.
Also, OpenAI CEO Sam Altman has been taking some publicly submitted prompts and posting the output. Look at those hands!
But lost in the “RIP Hollywood” threadboy posts are some interesting advancements and insights into what else is possible with this new tech and where things are going.
Big Jump in Object Permanence & Consistent Characters
One issue (and giveaway) with AI video is object permanence - say a person walks in front of a sign in the background, blocking it. When the person clears the sign changes because the AI forgot what was there.
Sora’s videos overcome this issue (most of the time). This is partly due to its ability to model environments (more on that below).
In one example, a group of people walk past a Dalmatian with pretty much zero change to the dog.
Or even in the Tokyo woman example, the street signs she walks past don’t change. Two Minute Papers does a really great visual demo of this (and is also a great dive into Sora’s abilities and breakdown of OpenAI’s paper).
Another big leap - consistent characters.
This is one of the biggest issues with GenAI that Michael Cioni mentioned in our chat that made him switch focus to UtilityAI - AI generates a new character (and environment, and props, and yada yada…) every time you prompt it.
Now, we’re still nowhere close to having fine-tuned control over the AI shot or putting the same character in a new setting, but Sora doesn’t just output one shot.
It can do up to 60 seconds of video with multiple shots.
Say you prompt it to generate a movie trailer, in that one output there will be multiple shots with the same characters (just look at the space man clip).
Connecting & Merging Videos
This feature didn’t nearly get as much attention but I think has more immediate, practical uses.
Like we’ve seen with Runway and Pika and Firefly - the prompt doesn’t have to be text, it can be an existing photo or video.
In this video-to-video example, the setting the car is driving in is changed with a prompt (there are a lot more great examples of the same shot transforming in this thread from Martin Nebelong, including the look of the car itself).
It can also connect and interpolate two different videos, like adding this snowy scene into a Mediterranean cliffside.
Plus it’s got a few more tricks, like generating a looping video and stitching multiple videos together into a seamless edit.
Modeling the World (And Splatting It)
I’m not a data or AI scientist, so my understanding of this may be off, but Sora’s training can simulate the physical world, which is partly why the outputs look so realistic.
Just look at the reflections of Tokyo in the woman in red’s glasses. Or the clip that made me question reality the most - filming Tokyo through a train with reflections in the window.
These capabilities suggest that continued scaling of video models is a promising path towards the development of highly-capable simulators of the physical and digital world, and the objects, animals and people that live within them.
Tim Brooks, who worked on Sora, posted the bigger vision behind this: “AGI [Artificial General Intelligence] will be able to simulate the physical world, and Sora is a key step in that direction.”
This direction isn’t new - Runway’s Valenzuela reshared a post from December 2023 about Runway’s General World Models.
Data scientist Ralph Brooks thinks Sora might’ve been trained on models from Unreal Engine.
With the ability to move the camera in this 3D world, people have been running the outputs through 3D Gaussian Splats and NeRFs to extract the 3D world and load it into Unreal or Unity, where you can do pretty much whatever you want with it.
This feels like the next generation of quickly creating virtual environments, truly creating a 3D space versus extracting 2.5D elements from a static background.
There’s A Lot of Compute Making This Look So Good
Besides the realistic-looking video, my first thought when I saw the Sora clips was, “Why does this look so much better than DALL-E 3?”
Take any frame from any of the videos, and the photorealistic quality looks like something from Midjourney (MJ), not DALL-E 3, which is also from OpenAI.
Here’s a quick comparison of MJ’s superior photorealism quality. And to see the Sora/MJ comparison, Nick St. Pierre has a whole thread running the same prompts through MJ.
My tweet questioning the quality difference got a lot of replies.
Two reasons seem plausible: they limit the quality of the public model for security reasons and they don’t have enough compute power to render such high quality for everyone.
Security: Very plausible. Sora is not public, access is only available to a select few to red team it and a few artists/filmmakers. The realism is fun and mind-blowing until we start seeing hyperreal deepfakes and question everything (especially in a presidential election year in the US).
Compute: Sora’s paper breaks this down - the more compute, the better the quality. They’re also training the model on high-resolution videos with no cropping - something that also takes a lot of power.
How much compute this actually takes, and would take if this is public, is beyond me, but I imagine ‘a lot’ is an understatement.
There’s also the environmental and thermal footprint issue. As Alan Lasky points out - “it is going to take a lot of water to cool all these GPU and Custom AI Chip servers.”
What Does This Mean for Creatives?
Right now, not much.
This isn’t public (yet) but obviously a hint of things to come and I think a splash in the face to a lot of us that things are moving even faster than we thought they would.
We have a new baseline - this is the worst AI video will ever look.
It’s pretty clear that stock footage is going to be mostly replaced by AI. But that shouldn’t be a surprise - 6 months ago we covered how Shutterstock was integrating AI tools.
What about camera crews? Editors? Directors?
I also don’t think that’s going anywhere.
What I do think is AI will replace teams with individuals.
One person will be able to do what takes a team of five or ten today.
What’s the best way to prepare for this? As James Knight put it, don’t get your AI info from others, dig in and mess around with it yourself.
And what about Runway and Pika and the other tools working on AI video?
I’ll end with this closing note from Valenzuela
SPONSOR MESSAGE
MASV is a faster, easier way to transfer large files to anyone, anywhere in the world—with just an internet connection.
Trusted by media and entertainment organizations across the globe, MASV plays an essential role in thousands of video workflows, giving production professionals absolute confidence in media delivery.
Unlike other file transfer services, MASV has no size limits and always delivers blazing-fast transfer speeds.
CASTING CALL
VP Gigs
Virtual Production Stage Technician, StageCraft
Industrial Light & Magic
Virtual Production Manager, CBS News
Paramount/CBS
Senior Concept Artist
ILM London
Producer/Project Manager
Cuebric
Unreal FX Artists
WildBrain Studios
CALL SHEET
📆 Upcoming Events
February 28-March 1
Mo-Sys Academy: 3-Day Camera Technician’s Course
🆕 February 27-April 2
Color Management for Virtual Production - Online
🆕 March 4-7
A Digital Davos - LEAP x DeepFest
March 11-15
Mo-Sys Academy: 5-Day ICVFX Course
🆕 February 11-April 22
Color Management for Virtual Production - On Set
View the full event calendar and submit your own events here.
ABBY SINGER
Thanks for reading VP Land!
Have a link to share or a story idea? Send it here.
Interested in reaching media industry professionals? Advertise with us.
Reply