The filmmaker David Mamet once said the main question in storytelling is "where do I put the camera?"
There are interesting events and compelling characters in the virtual world of your story. When in that timeline do you put the camera? Where do you point it in those moments?
Using newfangled video generation tools like Runway is kind of like putting a camera in a machine that can travel anywhere in time and space. Plop it somewhere in the timeline of the multiverse and point it at an interesting event. And another, and another, ideally connected by a compelling theme.
If you skim hubs where people are sharing their AI creations, you see a lot of people are doing the obvious and leaning into the limitations of the tools. If you’re just trying to generate videos of a concept with a few text prompts, you’re going to get a bunch of abstract clips best thrown together in montage to some music that leans into surrealism or absurdity.
But there’s a reason abstract art is mostly appreciated at small unprofitable showings and museums. It can evoke a strong reaction from you, disgust or laughter or whatever, but it doesn’t get its hooks in you like a narrative grounded in characters shifting in relative status through interactions and making choices that do and don’t work out in the pursuit of their individual goals.
I really want to see generative AI tools applied as well as they possibly can be. I want their successful use to encourage more experimentation with what’s out there by people with taste and more development on underlying generative power by people obsessed with the technology under the hood.
To that end, here’s a set of practices that I think would level up the collective output if followed Runway and Zeroscope’s early adopters:
Ignore the compulsion to go full abstract/montage and ground the video in some narration or dialogue between characters implied by the footage. People want to see a story composed of interesting events involving people. If you’re making a 1-3 minute short film using these tools, think about it as if you’re making a trailer for a really good full-length movie. What are the most evocative interactions set in the timeline of your story? You don’t need to show A and B shots of dialogue — just imply the interaction.
Avoid generation from text alone if possible. The takeaway from those “Wes Anderson does Star Wars” videos is that you can make some really compelling shots from nice Midjourney generations with a little bit of movement added. In the old way of doing things on a shoestring budget, the cliche is that $100 of lighting equipment and a new iPhone can do more for making an impression of being "professional” or “cinematic” than a $20,000 camera and anamorphic lenses. The new way to get a ton of bang for your buck is to nail shot composition and lighting at the still image generation phase, because we essentially have storyboards that can come to life. Take advantage of how easy it is to generate new, similar images to check whether a moment calls for a triumphant shot looking up at your character or making them seem isolated from afar or small and weak from above before you make a single video clip. It’s going to be a while before these models can make awesome tracking shots through fictional worlds, but if you lock the “camera” in place and make things move you can still do beautiful “photography”.
Everything else about filmmaking still matters, so you should still sweat all the other details. Once you’ve got generated footage, don’t just paste it together end-to-end on a timeline. If you’re done the work to show different perspectives of characters in the same interaction, edit to cut between them — as a character makes a point, before they finish their sentence, cut to their counterpart to show their reaction. Not to repeat the last bullet, but if you know you’re going to approach things this way, it can even feed back into your image generation step — what emotion do I need my “actor” to show in this clip? Make it look as good and true as possible up front.
Music and foley work can do a lot to ground footage that’s deep in the uncanny valley as well. Music signals what emotions to feel and when to feel them, and can imbue whatever you’re showing with weight and importance. A few bangs, clanks, and whirs can also go a long way in keeping disbelief good and suspended. Spend some time getting this right. If you’re using something like ElevenLabs to generate character speech, don’t use the first generation. Have intention for the emotional shifts of characters in dialogue and get the little tonal inflections as true as you can get them.
The fantasy or fear around these tools is a single text prompt generating an entire movie in one shot. Maybe one day such a tool will exist, firing off generative agents trying to work on different creative tasks while following reasonable best practices or heuristics like the above.
But between now and that future world, these tools are best used in the hands of people with perspective and intentionality. I’m saddened by the cultural divide between the old guard of professional creatives (and the aspirants to that class) and those playing with these new toys. What could be seen as the next generation of tools affording new techniques to people without the resources or connections to even begin a modest indie production are instead seen as an existential threat.
I don’t think essays or Twitter threads will bridge this divide. It will probably require the emergence a new Tarantino or Robert Rodriguez —someone who does something utterly undeniable that clearly channels passion for the history of cinema through a new, low-budget process and aesthetic that triggers a wave of inspired newcomers and commercial imitators.
I can’t wait to see it!