Tencent Hunyuan Embarks on a Fresh Start: Hy3-Preview Tested - Yao Shunyu's Maiden Attempt Yields Mixed Outcomes - AI

Home

Finance

ICV

Smart City

Digital Live

Cloud

Optics

Home Finance AI ICV Smart City Digital Live Cloud Optics

Tencent Hunyuan Embarks on a Fresh Start: Hy3-Preview Tested - Yao Shunyu's Maiden Attempt Yields Mixed Outcomes

04/24 2026 453

Yao Shunyu spearheads the overhaul of Tencent's expansive model.

The rivalry among domestic large-scale models has recently intensified, becoming quite dazzling. It seems like every few weeks, a new model emerges, boasting an impressive benchmark screenshot, only to be quietly updated in the apps on our phones. Without keeping up with the news, one might not even know which version they're currently using.

Well, Tencent has discreetly updated Yuanbao today with its new Hunyuan Hy3-Preview model, which is touted as the "inaugural work rebuilt from scratch." Leading this transformative effort is Chief AI Scientist Yao Shunyu, the brain behind the ReAct framework and a heavyweight scholar recruited by the Yuanbao team from academia last year.

(Image Source: Tencent Hunyuan)

Interestingly, Hy3-Preview deliberately sidesteps the benchmark race, advocating for the principle of "authentic evaluation." It voluntarily distances itself from public leaderboards, which are prone to manipulation, and instead opts for self-constructed questions and human evaluation to gauge its true capabilities. The official highlights note significant enhancements in three key areas: complex reasoning, coding, and agent functionalities.

(Image Source: Leitech Graphics / Official Promotional Web Game)

So, let's bypass the so-called benchmarks, data, and leaderboards and delve straight into practical testing to assess Hy3-Preview's performance in these three crucial domains.

Hy3 Coding Test: Complex Tasks Pose Challenges, but Generation Speed Impresses

Our hands-on test spans four directions: webpage generation, game development, interactive modeling, and SVG animation. To approach this from an average user's perspective, we employed natural language descriptions as prompts, such as "create an interactive music visualization website" or "develop a Roguelike dungeon exploration game." The objective was to observe the independent decisions Hy3-Preview would make and how far it could progress without explicit guidance.

(Image Source: Leitech Graphics)

For the initial round, we selected a moderately challenging SVG star map animation. The difficulty lay in envisioning the drawing of a moving starry sky on paper while allowing users to rotate it with their fingers and click on constellations to view stories—a scene often showcased in planetariums.

Utilizing the latest Hy3-Preview model in the Yuanbao client, the code was generated in approximately 30 seconds after inputting the prompt—a notably swift process. However, the outcome was mediocre. The basic framework was clear, and the approach to generating stars and planetary orbits was correct. Unfortunately, the meteor effect was absent, the drag-and-drop interaction had issues, and only two constellations were included.

(Image Source: Leitech Graphics)

To verify the prompt's feasibility, we also tested Codex. Under the same prompt, Codex took nearly 5 minutes to generate a webpage. It too failed to create the meteor effect, opting for particle effects instead. However, the constellation stories were complete, and the click-and-drag interactions functioned properly.

(Image Source: Leitech Graphics / Produced by Codex)

Next, we tasked it with creating a city nightscape SVG animation. This time, it fulfilled all requirements: layered buildings, randomly flickering windows, moving car lights, and double-flash lightning effects, with particular attention to the flickering windows.

(Image Source: Leitech Graphics)

We then proceeded to more challenging tests, such as having Yuanbao develop a web-based simulation construction game. Surprisingly, Yuanbao created a comprehensive game framework, including an economic system with income, expenses, taxes, and maintenance fees settled monthly. It even incorporated factors like traffic, noise, and greenery, along with random events such as notifications of "new residents moving in" or "tax increases."

(Image Source: Leitech Graphics)

When tasked with creating a classic Roguelike game, Yuanbao fell slightly short. While it designed three classes—warrior, ranger, and mage—and a reasonably structured dungeon map, it overlooked the most crucial element: enemies. Without enemies, the protagonist could only wander the map without gaining experience or leveling up.

(Image Source: Leitech Graphics)

Finally, in the interactive modeling section, we prompted Yuanbao to create an interaction where clicking a position would generate realistic water ripples. The result was impressive: it utilized pixel-level ripple superposition, directly manipulating Canvas pixel data frame by frame, combining the intensity of multiple ripples to create the water effect. Additionally, all three controls were fully functional.

The only regret was that the interference effect of ripple superposition wasn't pronounced enough. When two ripples intersected, the "brightening" overlap effect was weak.

(Image Source: Leitech Graphics)

From these coding tests, Hy3-Preview has demonstrated its capability in creative execution and interface presentation, making it suitable for demonstrations. However, for more demanding tasks, Yuanbao might selectively build the framework first before gradually inquiring about additional requirements. While the speed is impressive, the results aren't yet flawless.

Logical Reasoning: Is Yuanbao Misled by Superficial Appearances?

If programming tests assess whether a model can "execute tasks," reasoning tests evaluate whether it can "think clearly." To challenge its reasoning ability, we presented Hy3-Preview with four common-sense reasoning questions, requiring real-world understanding without formulas.

The results were unexpected: it stumbled on the "smartest" question but performed steadily on the most patience-testing one.

The first question was a carefully designed trap: "A bottle of water and a block of ice are placed in the same sealed cooler. After 24 hours, does the amount of water in the cooler increase or decrease?" The correct answer is unchanged because the sealed cooler retains all mass. Ice melts into water, and water evaporates into vapor, but the total remains constant. Hy3's answer: the water increases.

(Image Source: Leitech Graphics)

Its reasoning seemed plausible: ice sublimates in the cooler, and water vapor condenses on the cold bottle walls, increasing liquid water. While sublimation and condensation are real phenomena, it overlooked the sealed condition. Sublimated vapor and condensed liquid remain inside, meaning any increase in liquid water corresponds to a decrease elsewhere, maintaining total mass.

This is a classic oversight of details. The word "sealed" was crucial, yet it focused solely on sublimation and condensation, leading to a plausible but incorrect answer.

However, its performance improved on subsequent questions.

The second question asked: "Leaving home in the morning, you notice your neighbor's newspaper isn't collected, their car is still there, curtains are drawn, and lights are off. What reasonable explanations can you infer, and which is most likely?" This question has no standard answer but tests layered inference.

(Image Source: Leitech Graphics)

Its response was thoughtful, listing possibilities with "supporting points" and "doubts," concluding that the neighbor likely hasn't woken up yet. The car's presence suggests they're probably home, drawn curtains and off lights align with sleeping, and the uncollected newspaper is a natural consequence. The inference chain was clear, avoiding dramatic conclusions or prioritizing low-probability scenarios like "something happened." This approach—prioritizing the most mundane explanation—is actually the hardest in reasoning.

The third question asked why restaurants place their most expensive, unordered dishes on the first menu page. It correctly identified the "price anchoring effect," explaining that the dish's purpose isn't to be ordered but to make subsequent dishes seem reasonably priced. It also noted that placing it on the first page (rather than the last) maximizes the impact of the initial number on subsequent judgments—a detail it inferred independently, deserving praise.

(Image Source: Leitech Graphics)

Across the four questions, Hy3 exhibited an interesting pattern in common-sense reasoning: it was more prone to errors on questions requiring slow, deliberate thought but performed steadily on those requiring elaboration.

In other words, it excels at organizing answers coherently but sometimes struggles to identify key conditions in a question, occasionally hindered by its knowledge base. The first question exemplifies this—knowing too much caused it to overlook the two most critical words.

This isn't unique to Hy3; it's a common phenomenon among large models in common-sense reasoning. The true challenge isn't whether they know about sublimation and condensation but whether they pause to read the question thoroughly amid a flood of knowledge.

Nevertheless, beyond logical questions, Hy3-Preview felt more "human" this time. For instance, when I said, "I was criticized by my boss today and feel bad," it offered comfort instead of urging self-reflection. Whether this approach is right or not, it at least provided emotional support—something many people crave in such moments.

(Image Source: Leitech Graphics)

Frankly, answering a reasoning question correctly is one thing, but saying the right thing at the right moment is harder. The former relies on knowledge, while the latter demands understanding. Hy3-Preview seems to have a slightly better grasp of this than its predecessors.

Hy3-Preview: A Blend of Surprises and Regrets

After these tests, a nuanced contrast emerges—the model knows what it's doing but hasn't fully mastered it yet.

On the positive side, creativity and expression are Hy3-Preview's strongest suits. The city nightscape animation showcased aesthetic flair and attention to detail, the water ripple effect demonstrated the right approach, the neighbor scenario analysis was layered and clear, and chat responses lacked the typical "AI tone." Together, these indicate significant progress in understanding needs, organizing language, and fine-tuning expression. For chatting, writing, or creative tasks, the experience is genuinely good.

However, gaps emerge with harder tasks. Physical logic in mechanical motion was mostly incorrect, the cooler question was derailed by excessive knowledge, and the Roguelike game remained a shell. These cases point to the same issue: it can articulate ideas well but falls slightly short in execution.

Nevertheless, within the industry context, Hy3-Preview is a solid above-average model.

Over the past two years, the domestic large model competition has centered on two things: parameter scale and leaderboard rankings. Whoever has larger parameters or higher scores on MMLU or GSM8K dominates product launches. This approach made sense early on, establishing a common evaluation standard and quickly stratifying the industry, much like smartphone benchmarks where higher scores imply superiority.

(Image source: Tencent Hunyuan)

However, its problems have become increasingly apparent. The gap between the rankings and real-world user experience has long been felt by users. A model that ranks highly on a mathematical reasoning leaderboard might, when asked to "polish this paragraph for me," produce a result that is even more "AI-like" than your original text. The distance between evaluation questions and real-world tasks can sometimes be much greater than people imagine.

To some extent, the direction Tencent has chosen this time is a response to this issue. They propose not chasing public leaderboards but instead using real-world scenarios to validate model capabilities. This approach itself represents a new level of maturity in the industry, where the focus is not on who scores higher but on who is truly more useful.

From this perspective, the significance of Hy3 Preview lies not only in what it can currently achieve but also in the fact that it has chosen a more difficult yet correct path—abandoning the shortcut of leaderboard chasing and rebuilding everything from pre-training to reinforcement learning. After more than three months of work, the results, as tested by Leikeji, offer both surprises and a few regrets.

Hy3 Preview is currently adequate in terms of expression and creativity but still needs time for tasks requiring strict accuracy. For ordinary users, using it for chatting, writing, and handling daily information is worth a try. Regarding higher expectations, Tencent has stated that the official version is still on the way, and larger-scale models are also in training.

Moreover, since this version carries the "Preview" suffix, it indicates that it has not yet reached its official release. Perhaps once this suffix is removed, we will witness the true capabilities of Hunyuan.

Tencent Hunyuan Hy3-Preview

Source: Leikeji

Images in this article are from the 123RF licensed image library.

Solemnly declare: the copyright of this article belongs to the original author. The reprinted article is only for the purpose of spreading more information. If the author's information is marked incorrectly, please contact us immediately to modify or delete it. Thank you.

Newest

Links