07/11 2025
448
Grok is already facing hurdles.
"This is the smartest AI in the world," Musk, the founder of xAI, declared despite being nearly an hour late for the original launch event. He nonetheless unveiled the new generation large model, Grok 4, today at noon (September 10, Beijing time).
On paper, Grok 4 surpasses all competitors, including top large models like OpenAI GPT-3, Gemini 2.5 Pro, and Claude 4, in traditional benchmarks, SAT exams (American college entrance exams), and GRE-level tests across various disciplines.
However, more intriguing than these somewhat tedious traditional benchmarks is that Grok 4 also took the "Humanity's Last Exam" (HLE test), known as "humanity's final closed-book exam," and surpassed previous models, achieving a maximum accuracy rate of 44.4%.
Image/ xAI
During the livestream, Musk also pointed out that Grok 4 is smarter than almost all graduate students in all disciplines and is superior to doctoral levels in all disciplines in at least academic questions, "with no exceptions."
This is not yet the full potential of Grok 4. According to Musk, the seventh version of the Grok 4 base model will be completed this month, followed by post-training reinforcement learning (RL), etc., and it will eventually also have excellent video understanding and tool invocation capabilities. According to the roadmap, xAI will also launch code models, multi-model agents, and video generation models in the coming months.
Image/ xAI
In addition, they have also launched a higher-level subscription service, SuperGrok Heavy, which allows access to the "most powerful model," Grok 4 Heavy.
However, despite its seemingly invincible capabilities on paper, Grok still makes relatively low-level errors during actual demonstrations. More intriguingly, just hours before the Grok 4 launch, xAI's chief scientist Igor Babuschkin suddenly announced his resignation.
From a technical perspective, Grok 4 is not just a "routine iteration." During this 40+ minute livestream launch, the message that xAI tried to convey is that this is not just a new model challenging human intelligence but also an AI with enormous application potential.
Musk's claim that Grok 4 is "superior to doctoral levels in all disciplines" is not entirely an exaggeration for marketing purposes. In mainstream benchmarks such as AIME25, HMMT25, and GPQA, Grok 4 has further pushed the performance of large models to the extreme, with Grok 4 Heavy even scoring a perfect score on AIME25 (American Invitational Mathematics Examination).
Image/ xAI
But what is more iconic are the ARC-AGI and HLE tests. The former gained industry attention through OpenAI GPT-3's testing, mainly focusing on the "learning" ability of AI rather than "skills." Grok 4 achieved a 66% accuracy rate in version v1, surpassing GPT-3, and in the latest v2 version, it significantly outperformed other large models with a 15.9% accuracy rate.
As for the HLE test, it represents the limit of human intelligence. It consists of 2,500 professional questions proposed jointly by global experts, covering different disciplines such as mathematics, biology, computer science, chemistry, physics, engineering, and anthropology, hence being named "Humanity's Ultimate Exam."
Image/ xAI
Before Grok 4, the top-ranked model, Gemini 2.5 Pro, had an accuracy rate of 21.6%, followed by OpenAI GPT-3 with 20.3%. In contrast, Grok 4's accuracy rate increased to 25.4% and can further improve to 44.4% in its full form with the aid of tools.
During the live demonstration, xAI showcased Grok 4's accuracy on expert-level questions from the HLE test, which Musk believes only a very small number of humans can answer correctly. There are 2,499 similar questions.
Additionally, there is the Vending-Bench (vending machine benchmark test) based on commercial scenario simulations, requiring AI to manage inventory, contact suppliers, set prices, etc. According to the test results, Grok 4 is more efficient than both Claude Opus 4 and real humans in terms of operational efficiency, creating more than five times the net value of real humans.
During the livestream, xAI also conducted multiple demonstrations, including real-time capture of posts on platform X, compiling a timeline of participants in the HLE test, or identifying the team member with the strangest avatar. This not only showcased Grok 4's capabilities but also emphasized the advantages of deep integration with platform X.
Image/ xAI
The longest live demonstration generated during the livestream was Grok 4's analysis and prediction of the 2025 MLB World Series champion. The highlight was its use and analysis of tools and data, including browsing data from many odds websites for calculations. The entire process took nearly four and a half minutes.
Furthermore, Grok 4 can peruse key papers and materials to develop a webpage simulating changes that occur when two black holes come into contact. Musk also stated that they will provide Grok 4 with real professional tools, including specialized simulation software used by physicists, and predicted that Grok 4 may discover new physical laws next year.
This sounds overly exaggerated and lacks substantial support, but Musk's AI narrative may not be about surpassing Google and OpenAI but about changing the goal itself. From a product design perspective, xAI is attempting to turn Grok 4 into an AI tool tightly coupled with the information stream rather than just a robot that can answer questions.
Image/ xAI
In terms of model understanding, Grok 4 also demonstrates partial capabilities for multimodal input. Although there was no formal demonstration of image understanding and generation capabilities on-site, Musk emphasized that it is "under training." This means that the full form of Grok 4 will still be a multimodal large model rather than a reasoning model that only supports text like DeepSeek-R1.
In other words, this also means that Grok 4 can handle more complex sensory inputs, further expanding its applicable scenarios in the real world – such as humanoid robots, autonomous driving, scientific research modeling, etc.
It is worth mentioning that Musk mentioned during the livestream that "Grok 4 Heavy" is currently the most powerful version, surpassing the general model in reasoning, coding, and even the understanding of physical principles. However, the Heavy version is still in beta testing and is not yet available to the public.
Behind the launch of Grok 4 lies 10 times the training computing power of Grok 3, as well as the supercomputing cluster "Colossus" deployed by xAI in Memphis, USA, a few months ago. According to disclosures, this supercomputer is equipped with 100,000 NVIDIA H100 GPUs and may be the first to deploy GB200 computing nodes.
If you only look at the model itself, Grok 4 indeed demonstrates strength that cannot be ignored. Especially during this livestream, Grok's speech capabilities have also been upgraded – not only can it naturally switch tones, but it has also added multiple voice roles, including British accents. xAI even demonstrated that Grok can "sing" and recite poetry upon command.
The problem also lies here. During interaction, Grok was asked to "sing a song" but entered a "reciting poetry" state, reading out the lyrics in a declarative tone. Although it's a small mistake, it exposes the fact that the speech model's understanding of multimodality is still unstable – singing is not just pronunciation but a coordinated output of melody, tone, and rhythm, and Grok is clearly not ready yet.
Image/ xAI
Similar minor incidents ran through the entire launch. The livestream started an hour later than planned without any explanation. Although the content was rich, the overall pace was slightly rushed, and there was a lack of transitional logic between feature demonstrations. Some demonstrations were clearly pre-prepared. This somewhat hasty pace, combined with news of executive departures the day before, inevitably gives rise to concerns about internal instability.
On the day of the launch, xAI's chief scientist Igor Babuschkin announced his resignation, and earlier, X company CEO Linda Yaccarino also resigned, leaving a meaningful remark: "Now, as X enters a new chapter with xAI, the best is yet to come."
With the departure of the two executives, the launch of the conference, and Musk's repeated expressions of concern about AI being "too smart" during the livestream, a subtle sense of unease is created: Grok 4 may indeed be very powerful, but its organizational structure and product rhythm may not be ready to embrace the "intelligence leap" it has created.
Image/ xAI
A more realistic problem is that Grok 4 must still face the two strongest opponents in the world – OpenAI's ChatGPT and Google's Gemini. As technical capabilities gradually converge today, the real watershed often lies not in whether the model can answer a test question correctly but in the platform, ecosystem, and users.
More troublingly, Grok has maintained a "different" stance – having a personality, daring to speak, and being more free. This is the persona designed for it by Musk. But it is precisely this persona that makes Grok more prone to mishaps. Like in the past few months, it has caused controversy for generating extreme content.
So, this generation of Grok 4 is indeed very powerful and may already be smarter than graduate students and even doctors. But technological leadership does not represent user trust or product maturity. We still need to see how the model performs in actual experience.
During the livestream, Musk once expressed some concern about whether "AI's intelligence far surpassing humans" is good or bad for us but emphasized, "I have somehow accepted this reality. Even if it's not good, I at least want to live to see it happen."
Large Model Grok, Musk, AI, DeepSeek
Source: LeTech
Images in this article are from: 123RF Licensed Image Library Source: LeTech