Behind Musk's xAI and Oracle Negotiation Breakdown: How Many GPUs Are Needed to Fulfill the AI Giant's Ambitions?

07/11 2024 368

Image Source: Visual China

Blue Whale News, July 11 (Reporter Zhu Junxi) - On July 10, Elon Musk posted on social platform X confirming that his AI startup, xAI, has ended negotiations with software company Oracle regarding the expansion of their existing agreement and is instead turning to building an internal system for training AI models.

"When our destiny depends on becoming the fastest company to date, we must hold the steering wheel ourselves, rather than acting as a backseat driver," said Musk. xAI completed a new round of $6 billion in funding in May, the second-largest after OpenAI, pushing the company's valuation to $24 billion.

xAI's Ambitions Require More Computing Power Support

According to previous media reports, xAI is building a supercomputer that plans to connect 100,000 NVIDIA H100 GPUs to provide computing power support for the next-generation Grok model. Musk recently revealed that this system is being built by xAI itself, with the goal of starting training later this month, making it the world's most powerful training cluster.

Musk's ambitions may have led to the breakdown of the new negotiations with Oracle. Silicon Valley tech media The Information reported that Oracle believes the speed at which Musk demands the construction of a supercomputer is impossible. At the same time, Oracle is also concerned about potential power supply issues at xAI's preferred location. The media reported in May that xAI and Oracle were close to reaching an agreement to expand their partnership, with xAI intending to spend $10 billion over the next few years to rent Oracle's cloud servers.

After the news of the negotiation breakdown was announced, Oracle's share price fell by as much as 4.8%. On July 9, Eastern Time, Oracle closed down 3% at $140.68 per share.

As a global large-scale cloud service provider, Oracle can provide AI companies with computing resources, databases to help store and process data, AI development platforms, and other services. Last September, Oracle announced an agreement with xAI to provide the cloud infrastructure necessary for training AI models, but did not disclose the contract value or term. Musk introduced that xAI and Oracle signed an agreement for 24,000 NVIDIA H100 chips and based on this, the training of the second-generation large language model Grok-2 was launched. Currently, Grok-2 is still undergoing fine-tuning and error fixing, and is expected to be officially launched next month.

However, the end of this new negotiation does not mean the breakdown of the relationship between xAI and Oracle.

In the post, Musk also praised, "Oracle is a great company." Larry Ellison, the founder, chairman, and chief technology officer of Oracle, is also a close friend of Musk, having joined Tesla's board of directors and financially supported Musk's acquisition of Twitter. At an event in December last year, Larry Ellison publicly mentioned that xAI wanted more computing power, and Oracle was providing more for them.

In addition to xAI, Oracle's major customers also include Microsoft, Google, NVIDIA, etc.

During the fourth-quarter earnings call for fiscal year 2024 in June, Oracle announced that it would partner with Microsoft and OpenAI to integrate Microsoft's Azure into Oracle's cloud infrastructure, providing additional computing power support to OpenAI and facilitating the training of its large-scale AI models. OpenAI subsequently issued a statement emphasizing that this would not change the company's "strategic cloud partnership" with Microsoft. Microsoft is both a cloud service giant and the largest investor in OpenAI.

Anurag Rana, an analyst at Bloomberg Intelligence, said that Musk's decision to build AI training infrastructure internally highlights the expansion challenges faced by cloud computing providers, even when they have sufficient funds. "These issues are not limited to Oracle but may also plague Microsoft and Amazon's AWS, not only due to the shortage of specialized chips but also due to power shortages."

As the global competition for large AI models intensifies, the high consumption of electricity is attracting industry attention. Musk, OpenAI CEO Sam Altman, Meta co-founder and CEO Mark Zuckerberg, and others have all warned about the energy crisis the industry will face. Although AI chips are still in short supply, they have passed the peak of the shortage. This is not only because NVIDIA's delivery cycle has been significantly shortened, and other chip manufacturers such as AMD and Intel have begun to release alternative products, but also because cloud service providers have launched a series of new services, making it easier to rent chips.

Whoever Obtains GPU Resources Gets the First-mover Advantage

Backed by Musk, xAI can afford to spend lavishly, and if cloud service providers cannot meet training demands, it will build an internal system on its own. However, more small AI startups face more practical difficulties and have to compete with large companies with ample budgets, which are often easier to sign contracts with cloud service providers and seize high-quality AI chips.

Some venture capital firms have seen business opportunities in this. According to the latest report from The Information, Andreessen Horowitz (a16z), a well-known Silicon Valley venture capital firm, has stockpiled a large number of AI chips and plans to expand to 20,000 GPUs in the future, providing GPU leasing services to AI startups in exchange for equity.

a16z named this plan "Oxygen," and the first batch of customers includes Luma AI, a startup that recently launched the video generation model "Dream Machine." The co-founder and CEO of the company, Amit Jain, told the media that although other venture capital firms offered higher valuations to invest in Luma AI, they ultimately chose a16z because a16z promised to provide computing resources.

In China, technology giants such as Ali Group, Baidu, and Tencent are all using their own cloud service platforms to develop and train large AI models. Alibaba Group Chairman Joseph Tsai said at an event in late May that compared to foreign companies like Microsoft, OpenAI, and Amazon, Ali is actually one of the few companies that have both proprietary internal AI capabilities and cloud services.

These large companies are also present behind domestic AI startups. Taking Ali as an example, the company has successively invested in large model upstarts such as MiniMax, Dark Side of the Moon, Zero One, Baichuan Intelligence, and Zhipu AI, providing cloud computing quotas in exchange for equity.

On July 8, Yang Zhilin, the founder of Dark Side of the Moon, also became the latest spokesperson for Alibaba Cloud. The official announcement poster reads, "To create the popular Kimi intelligent assistant, Dark Side of the Moon cannot do without cloud computing. Alibaba Cloud enhances our model inference efficiency with its powerful computing power and large model service platform, helping Kimi intelligent assistant accelerate technological breakthroughs."

Solemnly declare: the copyright of this article belongs to the original author. The reprinted article is only for the purpose of spreading more information. If the author's information is marked incorrectly, please contact us immediately to modify or delete it. Thank you.