Text-to-video AI has arrived — and it's terrifying

MagusWazir/Twitter | 3deal/Reddit

The evolution of artificial intelligence is moving at a rapidly alarming pace, and keeping up with it is nearly impossible.

Just a few years ago, we saw the rise of deepfakes. Then came ChatGPT, followed by image generators like DALL-E-2, Stable Diffusion, and Midjourney — programs that can now, when prompted correctly, produce detailed and realistic images like this:

Despite most people having a minimal understanding of AI-generated images, a new monster has already entered the digital arena: text-to-video.

You may have seen a few bizarre AI-generated clips circulating on social media over the past few days. The short videos feature simple yet obscure concepts like Will Smith eating spaghetti or Donald Trump meeting Godzilla.

“Will Smith eating spaghetti” generated by Modelscope text2video

credit: u/chaindrop from r/StableDiffusion pic.twitter.com/ER3hZC0lJN

— Magus Wazir (@MagusWazir) March 28, 2023

Trump VS Godzilla – ModelScope + Img2Img
by u/3deal in StableDiffusion

You would probably argue that these AI-generated files look quite clunky and awkward, but it’s only a matter of time before they are indistinguishable from a real video clip. And as one can imagine, that brings with it a batch of new problems.

You might also like:

Canadian tech company calls four-day workweek a "huge success"
Here's why the blue Twitter bird logo became a Shiba Inu dog

To better understand the new technology, Daily Hive spoke with Rijul Gupta, a machine-learning engineer named a “DeepFake thought leader” by Forbes. Gupta is also the founder and CEO of DeepMedia.AI, an American company that builds datasets to power high-accuracy DeepFake detection for the US Government and incorporates synthetic faces and voices into its Universal Translator to help people communicate.

He says we are less than a calendar year away from a massive boost in AI-generated video quality.

“The quality of these text-to-video models is kind of where these stable diffusion [text-to-image] models were six to 12 months ago. So it’s normal to think that six to 12 months from now, we’ll have text-to-image or text-to-video models that are really, really high resolution,” Gupta says, adding that “soon, we will have text-to-voice models and voice-to-video models that are really really high quality. And so, you’ll be able to type in some words, and instead of having an image of the pope in a puffy jacket, you’ll have a video of the Pope rapping and breakdancing in a puffy jacket.”

At the moment, the entrepreneur sees two problems with AI videos. Firstly, individual frames are not yet fully realistic. Secondly, the temporal qualities don’t match on a frame-to-frame basis. Both will likely be solved shortly, though.

“It would follow the same kind of pattern that happened with text-to-image models,” he explains. “You’re going to get really high-quality individual frames, then they’re going to scrape a bunch of stock videos, try the temporal networks on the stock videos instead of the stock images (…) then integrate those new techniques into these text video models.”

With the emergence of AI voice matching and the ability to alter and create realistic videos of just about anything at everyone’s fingertips, there is an immeasurable level of harm that can be done by falsifying the words and actions of public figures and society at large.

When asked whether the technology has entered a danger zone, Gupta agrees that we are already past a point of no return. “It’s over. The genie’s out of the bottle.”

Despite that grim realization, though, he is optimistic that we can steer things in a better direction.

“At this point, the only thing we can do is build detection software and integrate that across all video media audio platforms. There’s a solution here. It’s not impossible. But governments and corporations need to step up,” Gupta explains. “Because you can’t expect the average person to be that truth-teller. It’s the responsibility of companies like Twitter and Facebook and the government to say that this is real and this is not real.”

While Gupta and his team at DeepMedia try to stay ahead of the curve, he says his ultimate goal is “to give people, governments, and institutions the ability to determine truth from fiction.”

As for whether software companies behind these art generators should be held responsible for the product their AI generates based on user prompts, he says they should.

“I don’t believe that as tech leaders, our responsibility ends with the code. Our responsibility is building applications that cannot be abused,” he says.

“You wouldn’t normally hear that answer from a tech CEO, right? Most tech CEOs try and pass the buck and say, ‘We create the application. What people do with it as their own business,'” Gupta explains after revealing that his company has had the tools to release an AI celebrity voice app for quite some time but opted not to for ethical reasons.

“We’ve had that capability for the past three years. What we’ve been working on is how to build applications around it that exist in walled gardens. So when people upload videos of themselves or someone like Barack Obama, the only thing they can do is translate it and vocalize it in foreign languages because that is an ethical-only application.

The tech CEO says those working in the field are not surprised by the current level of messiness and chaos, even comparing it to the “Wild West”. And with that, he believes that AI still has the power to make a positive impact.

“The first applications are going to be porn and scams. But the applications that come first are not the applications that change the world. The applications that change the world are the ones that provide a lot of value to people and a lot of economic value to individuals and businesses.”