From Kling to Gemini: AI Videos Bid Farewell to the 'Gacha Mode'—Is the Director Model About to Take Off?

Home

Finance

ICV

Smart City

Digital Live

Cloud

Optics

Home Finance AI ICV Smart City Digital Live Cloud Optics

06/05 2026 526

Video generation finally no longer relies on luck.

The gacha era is coming to an end.

Over the past year or so, our experience with AI videos can actually be summed up in two words: gacha. You input a prompt, click generate, and stare at the progress bar as the model spits out a few seconds of footage. If it looks good, you keep it; if not, you tweak the wording and try again. While it can indeed produce stunning clips, what it offers creators has never been a usable piece of material to continue working with—it's more like a card you draw once, keep if lucky, and discard if not.

The most frustrating aspect of the gacha era wasn't that the footage lacked realism but that it was uncontrollable. You wanted a polished nine-out-of-ten final product, but the model gave you ten clips, each scoring seven or eight but not aligning with each other. You couldn't negotiate with it—'Keep this shot but change the character's action'—all you could do was roll the dice again and hope for a better result.

But this approach has started to change recently. Over the past month or two, several new video models have emerged almost simultaneously, differing in product form, technical approach, and target market. Yet, the signal they release is surprisingly consistent: the competitive focus is no longer on who can generate a more visually appealing video in one go but on whose output can be continuously modified, controlled, and reused. In other words, AI videos are transforming from a mere output machine into a production toolkit.

(Image source: Google)

The question then arises: As AI videos reach this stage, will the creator's core competitiveness (competitive edge) shift from editing to something more akin to directing? After all, we no longer have to 'gamble' on the content generated by the video model. Will better expression and shot design become the future focus of AI video creation?

The hottest topic in 'editable' AI videos recently might be Google and Runway.

Runway introduced Aleph 2.0, which focuses on modifying content based on the original video's context. In essence, it no longer treats each generation as a blank slate but recognizes what's in your existing footage, enabling localized edits while understanding the original clip—rather than starting from scratch every time. Google, on the other hand, offers Gemini Omni, which takes a different route, emphasizing conversational continuous editing. You can make requests one sentence at a time, like chatting with a person, allowing the model to refine the previous version iteratively instead of starting over for every new demand.

(Image source: Runway)

For example, we asked Gemini to generate an ad-quality video featuring a white ceramic cup on a wooden table, with the camera slowly pushing in. Beside the cup lies a notebook and a black pen, under natural daylight, shot with a realistic phone camera feel, set against an ordinary studio background. In the first round, Gemini's output was already quite satisfying.

(Image source: Lei Technology Graphics)

Gemini generated an empty shot video of a white ceramic cup, notebook, and black pen on a wooden table. The main subjects were clear—the cup, notebook, pen, and table—with the camera slowly pushing in from a medium-long shot to a close-up, meeting our needs. However, it didn't quite look like an ad.

(Image source: Lei Technology Graphics)

So, we directly asked Gemini to make the footage more resemble a coffee brand ad, such as adding subtle steam to the coffee, incorporating soft highlights on the cup's rim, and so on.

(Image source: Lei Technology Graphics)

It's evident that the cup, pen, notebook, and even the background scene remained unchanged. What did change? The timing of the coffee's appearance, the camera movement techniques, and the steam effects.

This is precisely the intermediate state of AI videos transitioning from generation to editing. In the past, you wrote a prompt and waited for the model to output a clip; now, you first generate a base material and then tell the model what's lacking. Creators begin to direct modifications like a director, though the model still can't obey as precisely as editing software. It's no longer just gacha, but it hasn't fully matured into a true post-production tool yet.

Gemini's conversational editing approach is just one path. Domestic players like Kling and Seedance 2.0 are pushing 'editable' capabilities toward a more systematic direction, albeit through different entry points.

Kling O1's strategy is to consolidate the entire workflow into a single engine. Generation, modification, referencing, style redrawing, and shot extension—tasks that were either impossible or required jumping between multiple tools—are now meant to be done end-to-end in one place. This approach is clever because it doesn't position itself as a generator with strong single-point functionality but as a creative workstation. For creators, the most exhausting part has never been a single step's difficulty but the inefficiency of moving a project across seven or eight tools, importing and exporting repeatedly. Kling aims to tackle precisely this internal friction in workflow connection ( connection ).

(Image source: Kling)

Seedance 2.0, meanwhile, focuses on multimodality. It turns text, images, videos, and audio into references that can be fed into the system to enhance reference-based generation, video extension, and audio-visual synchronization. Traditionally, we've only focused on visual appeal when discussing video models, but a video is never just moving images—it's the interplay of visuals, motion, sound, and rhythm. By incorporating sound and motion into controllable parameters, Seedance reminds us that video models can't just 'paint'; they must also understand rhythm and know where to make cuts.

(Image source: Seedance 2.0)

More bluntly, from the perspective of overall video model direction, the gacha era has thoroughly (thoroughly) ended. Next comes the 'editable era.' Whichever model can streamline the entire process, provide users with the most intuitive prompt optimization, and offer secondary editing solutions will continue to dominate the high ground.

Circling back to the initial question: When AI video generation is no longer gacha, will the creator's role in the workflow change? My judgment is yes.

In the past, an excellent video creator relied on hands-on skills like editing, color grading, transitions, and soundtracking, meticulously crafting their style frame by frame. These abilities won't become obsolete, but when models can understand commands like 'Keep this camera movement but adjust the texture to resemble an ad,' what truly sets creators apart becomes another set of skills: describing shots, controlling rhythm, and judging where to preserve versus rebuild. In essence, it's the ability to 'direct the model.'

AI videos won't immediately replace editing, nor will creators become mere prompt engineers. Both extremes are oversimplifications. The more accurate shift is that the focus of video production is moving from 'material processing' to 'intent orchestration.' In the past, you manually pieced together footage; moving forward, you'll primarily instruct the model on what you want, don't want, and where the current version falls short.

(Image source: Lei Technology Graphics)

This orchestration ability has a threshold. Who can translate a vague creative idea into model-understandable shot language? Who can instantly assess a model's output for usability and identify gaps? That person will resemble the 'model director' of the future. A director doesn't necessarily operate the camera or make every cut, but they know what the entire film needs and which path to take at each fork. This is what creators will do once AI videos mature.

Tools evolve, and so do the barriers to entry. Yet, the core of creation remains unchanged: a clear vision in your mind and the willingness to iterate until the model delivers. The gacha era is ending, and gamblers will dwindle. What's truly scarce are those who know what they want and can make the model deliver it.

Whenever a new tool automates a craft, someone inevitably cries, 'My job is doomed!' But history shows that tool upgrades eliminate not the craftspeople but the most mechanical parts of their work.

The classic example is spreadsheets. Before VisiCalc and later Excel, accountants and finance professionals spent hours calculating cell by cell and recording transactions manually. Spreadsheet software automated these repetitive tasks, but accountants didn't lose their jobs—they transformed into 'model builders, trend analysts, and decision advisors.' The most tedious execution was automated, freeing up mental energy to make the profession more valuable.

Before nonlinear editing software like Premiere and Final Cut, editing meant physically cutting film with a blade or rewinding tapes frame by frame—hence the term 'cutting video.' These tools eliminated physical 'cuts,' but editors didn't disappear; they shifted focus to rhythm, narrative, and emotion—higher-level judgment calls. Tools replaced manual labor but preserved intellectual decision-making.

(Image source: Seedance 2.0)

After AI programming assistants emerged, programmers initially panicked, 'Will I still need to write code?' The reality is that time spent on boilerplate code shrank, allowing more focus on reviewing model-generated code, clarifying architecture and boundaries, and judging which segments to trust or rewrite. Coding remains vital, but the scarcer skill became knowing what to ask the model to write. Today's Vibe Coding lowers the 'entry' barrier somewhat, but delivering a project from scratch often falls short.

For AI videos, the next phase isn't about visual realism but stability, controllability, and editability. Creators won't just write prompts; they'll act as model directors, knowing what to preserve, modify, reference, and iterate. Editing skills won't vanish, but the most valuable ability shifts from 'software proficiency' to 'model orchestration precision.'

Tools keep advancing, and workers must strive to remain irreplaceable by AI. The gacha era is ending, gamblers will fade, and what's always scarce are those who know what they want and can make the model deliver it.

Goggle, Kling, SeeDance, video large models, AI videos

Source: Lei Technology

Images in this article come from: 123RF Royalty-Free Library. Source: Lei Technology

Solemnly declare: the copyright of this article belongs to the original author. The reprinted article is only for the purpose of spreading more information. If the author's information is marked incorrectly, please contact us immediately to modify or delete it. Thank you.

Newest

Links