79 | OpenAI's Sora: From Text to Video in the Blink of an AI

OpenAI's revolutionary new tool may spur a resurgence of professional journalism

Feb 19, 2024

Image captured from OpenAI Sora-generated video

Introduction

Today’s StrefaTECH topic is a departure from prior themes because I just can’t let pass what may be a watershed moment in the development of AI—OpenAI’s announcement of Sora, its truly revolutionary app for creating videos from text prompts.

Sora isn’t available yet, and OpenAI hasn’t announced its pricing plans. And it could be that few of us need to generate videos. But just as ChatGPT brought AI-based conversations, DALL-E gave us the ability to easily create really good images, Sora promises to make it possible for anyone to make a video by just describing it.

There’s a theme that underpins all of these generative AI technologies, which is that it’s “easy” to create “really good” outputs, whether generating stories/essays in ChatGPT, cartoons/photorealistic images in DALL-E or (soon) movie-like scenes or fantasy videos in Sora. The “easy” part is the technology being available to anyone with a computer or phone who can interact in one of many languages and has the patience to iterate until the AI tool generates something acceptable. The “really good” output is seldom perfect: ChatGPT hallucinates false information, DALL-E struggles to spell and often fails to incorporate simple elements from the user's request, and Sora can have mysteriously appearing/disappearing images in a video.

But if you can use something that’s less than perfect, either by correcting the AI-generated draft or by being OK with imperfections, the ease with which these are generated is absolutely amazing!

What SORA Creates

From the OpenAI announcement, “Sora is an AI model that can create realistic and imaginative scenes from text instructions… Sora can generate videos up to a minute long while maintaining visual quality and adherence to the user’s prompt.”

The OpenAI announcement page includes dozens of sample videos that OpenAI asserts were “generated directly by Sora without modification.” Watch the first minute or two of their announcement video to see samples:

How It Works

From my reading of their research paper:

You type a prompt, similar to one you’d use with DALL-E. For example, the prompt for one of my favorites in their announcement was: “A litter of golden retriever puppies playing in the snow. their heads pop out of the snow”
SORA then generates a more descriptive caption, much like DALL-E.
From that expanded caption, SORA uses some revolutionary approaches to create the videos. For the more technical among you, read the report here. For the rest of us, here’s ChatGPT’s summary (including my prompt asking for it to tell us about what’s going on in layman’s terms!).

What It Means

As if trying to get your hands around with ChatGPT does well (and does poorly or scarily) … then trying to figure out how to beat DALL-E or Midjourney into submission to give you an image that doesn’t have hands with 6 fingers or a bunch of mispelled wrds … now there’s this video generator that raises questions like, “what the heck would I do with a video?”1 Sheesh!

But actually, what I believe this means is that the way we communicate with each other through our computers and smartphones is going to change. It took a while for emojis to catch on outside of teens texting each other; now they’re common in all kinds of grown-up social media posts, blogs, emails, text messages, and more. Along the way, we’re all becoming more comfortable with photos and videos taken by smart phones, including exploring the boundaries of what’s “OK.” Next up is computer-generated images and now videos.

And of course, just because the tools are available for almost anyone’s use doesn’t mean everyone *should* use them. It also goes without saying that this technology opens up a whole new world of potential nastiness—deep-faked audio and images are an enormous fear at the start of this US election year. Now comes video, which is even more compelling, whether used for good or ill.

OpenAI has asserted that the release of Sora will take place only after extensive safety testing and measures.2 However, it’s been proven repeatedly that the snake-oil salesmen of 150 years ago exist in every generation—it’s just what they’re offering that becomes more and more dangerous as technology advances.

The skepticism you (hopefully!) have when reading an email supposedly from a Nigerian prince who is providing a once-in-a-lifetime opportunity for you to become rich … you now need to strap on those skeptic glasses when you see an image or a video.

How To Live In This Era

I envision a resurgence of trusted journalism as the source of information. Over the last decade or so, the amateurs who’ve been telling us about what’s going on in the world via social media, blogs, podcasts, and videos have ruled, pushing out the traditional, mainstream media.

Ironically, I predict that the technological developments that enabled that change—the internet, digital cameras, social media, smartphones—will crush the change in the next generation of tech innovation.

In a world where the good folks sharing stories are mixed in with the hucksters telling lies, we will want to return to sources of information that we can trust.

If a fake ‘fact’, story, recording, image, or video gets past those in charge of a news outlet like the New York Times or a political campaign and is found out, the reputational consequences could be enormous. That makes those sources much more trust-worthy than the various influencers and amateur reporters who have displaced them. For those folks, discovery of putting out fake content is largely irrelevant; most can just create a new persona or move to another outlet.

Social media, podcasts, blogs, and the like still can have a valuable place to interpret what’s happened or is going on and to educate and train. We consumers still have to decide which of them are trustworthy enough to listen to.

But when it comes to important news and facts, I picture and indeed hope for the return of the professionals. They and we need to be sure that when we use these amazing new tools to create content, we can stand behind what we put out there. While our stories, images, recordings, and soon videos may be ‘artificial’, they shouldn’t be misleading. And it’s our reputations on the line, not the technology’s.

So as you explore these tools—and please do!—remember, it’s incumbent on YOU to…

Make Good Choices

This is a question I intend to explore with various folks I’m working with on AI tools. This could be the “year of video” in many ways, and while it’s really exciting to see what’s possible, it is also yet one more thing to ponder, learn, and incorporate safely and smartly. Try not to hyperventilate!

From the OpenAI Sora announcement:

We’ll be taking several important safety steps ahead of making Sora available in OpenAI’s products. We are working with red teamers — domain experts in areas like misinformation, hateful content, and bias — who will be adversarially testing the model.

We’re also building tools to help detect misleading content such as a detection classifier that can tell when a video was generated by Sora. We plan to include C2PA metadata in the future if we deploy the model in an OpenAI product.

In addition to us developing new techniques to prepare for deployment, we’re leveraging the existing safety methods that we built for our products that use DALL·E 3, which are applicable to Sora as well.

For example, once in an OpenAI product, our text classifier will check and reject text input prompts that are in violation of our usage policies, like those that request extreme violence, sexual content, hateful imagery, celebrity likeness, or the IP of others. We’ve also developed robust image classifiers that are used to review the frames of every video generated to help ensure that it adheres to our usage policies, before it’s shown to the user.

We’ll be engaging policymakers, educators and artists around the world to understand their concerns and to identify positive use cases for this new technology. Despite extensive research and testing, we cannot predict all of the beneficial ways people will use our technology, nor all the ways people will abuse it. That’s why we believe that learning from real-world use is a critical component of creating and releasing increasingly safe AI systems over time.