How deep learning could give us more Firefly

Take my love, take my land. Take me where I cannot stand.

07 October 2020

Home —  Categories

We all want more Firefly, so let’s do a thought experiment as to how deep learning could achieve this!

The Screenplay

Technologies like GPT-3 show that we now have deep learning models that can generalize and reproduce a broad variety of text outputs. For instance, GPT-3 allegedly went undetected on r/AskReddit for a week or so. GPT-3 is what’s known as a “one shot” or “few shot” technology, where it has baked in ability to recognize the type of output you want from a single example. So let’s say we fine tune GPT-3 with a bunch of TV show scripts and sci-fi novels and then show it the actual original Firefly screenplay and see what it produces. “Fine tuning” is something you can do on GPT-2 whereby you add more data to make it more purpose-built for your particular task. Fine-tuning allows you to specialize a big giant model without starting from scratch every time.

Stay tuned, I think I will give this a shot on my own sometime soon! If I get around to it, I will post an update here, with the results and the code!

EDIT: Oh look, someone already did it!. There’s a lot of attention on GPT-3 for this task already

What a screenplay looks like

Just for reference, screenplays are highly standardized with very specific syntax and formats so they are universally accessible and interpretable by directors and producers. They include descriptions, actions, and dialog.

Here’s an example of what one looks like on paper

The Video

First we need to break it down into chunks. We don’t need to use a deep neural network to make an entire 44 minute episode. Episodes are broken down into scenes, and scenes are broken down into cuts, or “transitions”. Everytime the camera cuts away to a new perspective is a film segment. Some directors and producers make use of rapid cuts, meaning we would only need to generate a few seconds of video at a time, and then stitch it together. Firefly, which I just rewatched with my partner, seems to be pretty standard. A long cut in that show would be 30 to 60 seconds, but most are shorter, close up of faces and dialog with some action sequences, establishing shots of Serenity and scenery. DVD-GAN is already on the way to full video synthesis.

Scenery and Settings

GANs (Generative Adversarial Networks) have become exceptionally good at generating realistic images from basic information. NVIDIA released a library called Imaginaire that attempts to standardize this technology, making it more and more accessible. I think it’s only a matter of time before this increases in sophistication and quality. Right now, text-to-image technology leaves a bit to be desired! As GPUs get more powerful and data increases, we will inevitably see better models.

Characters

We humans (and yes, I’m a human) are finely calibrated to recognize faces. The uncanny valley has been the death of many early technologies, from CGI to video games. Sites like This Person Does Not Exist, however, demonstrate that we are well past the uncanny valley of face generation. Nathan Fillion has already been deep-faked into live-action footage. So why not Firefly?

The Audio

Speech

Text-to-speech is nothing new. The latest and greatest speech synthesis adds inflection, tone, and style tags. The subtle quality of human emotion poured into speech can be summed up as “prosody”. Here’s a creepy-realistic example of prosody embedding!

Sound Effects

Back in 2016, MIT published some work about natural sound generation for video. Today, Adobe can provide this as a service, and it’s eerily high quality. Seriously, check out this train.

Music

Synthetic music is also nothing new, it’s just getting way better. And today, music AI is getting to an entirely new level of beautiful.

Implications

Netflix and GAN?

So what does this mean? I think the only logical conclusion is we’re going to see a massive explosion of consumer media. If they aren’t already, I suspect that Netflix and Amazon are hard at work creating fully synthetic consumer media. We’ll probably see books and short stories first, but it’s only a matter of time before that morphs to TV and movies. Imagine this: Netflix creates a library full of tens of thousands of movies and shows, all procedurally generated and rated by the masses. Those that are good percolate up. They are filled with actors that never existed, written and produced by directors who never existed. This level of automated media production is still prohibitively expensive. It took millions of dollars just to train GPT-3, which can only do generalized text tasks. It will take something a bit more powerful and sophisticated to do high quality TV screenplays and 4k 60fps film.

IP Laws

I haven’t the foggiest clue as to how this is going to play out. I suspect that IP (intellectual property) laws will say that AI models trained in-house are propriety and that any data they use is part of the model. Thus, they will need rights or license to train on other TV and movies. You can’t just copy the screenplays of every Marvel movie and not expect Disney to sue your pants off! Even if an AI is then just learning from the screenplay, the same that another director might. I could see that going to the Supreme Court.

Personalized TV, Movies, Music, and Books

As GPU technology advances, which it is rapidly due to demand, it will become cheaper and cheaper to train giant models. My personal desktop computer, with its NVIDIA RTX 2070, is more powerful than ASCI Red, the top supercomputer from 1997. Commercial industry is usually about 10 years behind the cutting edge of massive computing power and private homes are about 20 years behind. It took supercomputer level processing to train GPT-3, so we can expect run-of-the-mill businesses to be able to do that by 2030, and everyone else by 2040. As the cost of producing these giant models comes down, and the quality of data and output increases, we will soon see hyper-personalized entertainment. Want more Firefly? Just ask Netflix or Amazon. Want more Game of Thrones, except the ending is way different? That could be possible, too! And I don’t mean recommender systems, either. I mean stuff that is generated just for you.