The Era of Large Models in Edge AI Begins with Minwall Intelligence

Home

Finance

ICV

Smart City

Digital Live

Cloud

Optics

Home Finance AI ICV Smart City Digital Live Cloud Optics

05/26 2026 641

The most powerful compression technology in human history is not ZIP or JPEG, but text.

A single character like 'fire' encapsulates complex elements such as combustion, temperature, color, danger, and energy into one symbol. Written in a few strokes, it carries extremely high information density and requires minimal decoding effort.

Large model compression essentially does the same thing: designing a more efficient 'script' to carry the most knowledge with the fewest strokes.

However, the semiconductor market in 2026 presents an extreme sense of fragmentation. On one hand, institutions like Goldman Sachs continuously raise expectations, with HBM (High Bandwidth Memory) in short supply and DDR5 prices surging, making the 'memory wall' a physical bottleneck for computing power development. On the other hand, edge AI devices, constrained by power consumption and size, cannot infinitely expand memory specifications, seemingly trapped under an invisible cost ceiling for deploying large models.

Computing power is increasing, but terminals cannot accommodate it. 'Old scripts' take up too much space. Traditional large models use FP16 to represent parameters, akin to writing essays in complex English letters, resulting in enormous length (length). The industry urgently needs a 'new script' with fewer strokes and higher information density.

On May 23, BitCPM-CANN made its debut at the Huawei Kunpeng Ascend Developer Conference (KADC 2026), with Li Yuxuan, Head of AI Infra at Minwall Intelligence and a postdoctoral researcher at Tsinghua University's Institute of High-Performance Computing, sharing technical details.

Simply put, Minwall Intelligence achieved the world's first end-to-end 1.58-bit (extremely low-bit) training stack on Huawei's Ascend platform, scaling it to the 8B level with nearly no performance degradation compared to full-precision models of the same size.

It sends a clear signal: on a domestic computing power foundation, a world-leading training paradigm can emerge.

So, how was this 'new script' designed? How will it reshape the industrial rules of edge AI?

Before exploring the significance of Minwall Intelligence's technological breakthrough, it is essential to examine the real challenges facing the edge AI industry today. The AI industry in 2026 appears prosperous on the surface: large models are rapidly moving from the cloud to phones, PCs, and cars, with concepts like 'AI Phone' and 'AI PC' emerging endlessly, as if everyone can access the most powerful AI.

However, beneath this enthusiasm, a covert battle over survival costs is unfolding.

The root of the problem lies in the 'script' we use to carry AI knowledge.

Traditional large models employ a rather extravagant 'script system.' Each parameter is represented in FP16, meaning each stroke occupies 16 grids. A 7-billion-parameter model requires about 14GB of memory just to 'write down.' With the operating system and other applications, even a 16GB flagship phone cannot accommodate it.

This 'old script' not only takes up space but also has soaring writing costs. Goldman Sachs' latest forecast shows that driven by demand for AI servers, the memory chip market is experiencing a supercycle. HBM is in short supply, and prices for mainstream memories like DDR5 are expected to rise by up to 280%. For edge device manufacturers, this presents a cruel choice: either accept rising BOM costs and squeeze already thin profit margins, or reduce memory configurations, turning AI features into marketing gimmicks that 'can be installed but cannot run.'

The inherent 'memory wall' deepens this dilemma. Under the von Neumann architecture, computing units and memory units are physically separated, requiring frequent data transfer between them. Even if the TOPS rating of edge chips is high, if memory bandwidth cannot 'feed' data in time, most computing power will idle.

Moreover, the training and deployment of domestic AI large models have long been highly dependent on NVIDIA's CUDA ecosystem. For a long time, many domestic large models still needed to complete core validation on NVIDIA clusters before being laboriously migrated to the Ascend platform. This 'detour' model not only lengthens R&D cycles and increases trial-and-error costs but also keeps domestic computing power platforms in a chasing position, struggling to establish a truly independent technical system.

Under layered pressures, today's edge AI is trapped in an unsolvable impossibility triangle: achieving stronger model capabilities requires higher hardware costs; controlling costs and reducing power consumption necessitates sacrificing model capabilities. There is little room for compromise among the three.

Traditional solutions like model distillation and knowledge pruning essentially trade accuracy for memory, resulting in sloppier 'handwriting.' What the industry truly needs is a script system with fewer strokes and higher information density. BitCPM-CANN, unveiled at KADC 2026, precisely addresses this core pain point.

On May 23, 2026, at the Huawei Kunpeng Ascend Developer Conference, Li Yuxuan, Head of AI Infra at Minwall Intelligence, formally presented BitCPM-CANN—an open-source ternary large model trained entirely on domestic computing power. It is an extremely lightweight 1.58-bit large model. From the underlying operators to the training framework and final model output, the entire pipeline is natively completed on Huawei's Ascend platform.

Many may wonder, what exactly is 1.58-bit? In the most relatable terms, it is like an extremely streamlined script system in the AI world.

Traditional large models, with their 16-bit floating-point parameters, can represent a very wide range of values, appearing highly precise. However, much of this precision is redundant in practice, akin to using a complex Latin alphabet to write a simple daily phrase—both space-consuming and unnecessary. BitCPM-CANN compresses each parameter to just three values: -1, 0, +1. If traditional parameters are compared to a complete English system with uppercase, lowercase, symbols, and special fonts, 1.58-bit is like reducing everything to the three most basic strokes: 'dot,' 'horizontal,' and 'vertical,' carrying the core information with minimal expression.

Of course, beyond having 'fewer strokes,' BitCPM-CANN systematizes this new script.

First, it offers a complete and usable character library. Previously, research on low-bit models existed, but most results were isolated demos showcasing a single fixed size or metric, making it impossible for manufacturers to assess technical stability or practical applicability. BitCPM-CANN breaks free from this limitation by releasing four complete models at 0.5B, 1B, 3B, and 8B scales, with full-dimension alignment evaluations against full-precision models of the same size. Technically, it employs an STE ternary quantizer, retaining residuals during training to ensure continuous learning and strictly outputting ternary weights during export, minimizing precision loss. Actual test results are compelling: the 1B, 3B, and 8B models retain 95.7% to 97.2% of their capabilities, with the 8B model achieving 93%–99% of full-precision performance on key tasks like ARC/cmmlu/gsm8k, reaching production-ready levels; even the 0.5B model retains 90.1%, providing clear directions for future optimization. This full-scale coverage from micro to medium models is like preparing a complete script system for the AI industry, ranging from short phrases and essays to full-length works, allowing phone, automotive, and terminal device manufacturers to directly select models based on demand without starting from scratch.

Second, it establishes a mature and stable typesetting specification (standard). Many low-bit projects stop at 'running successfully,' declaring success once the model can train and metrics improve. However, such code is often disposable, requiring re-tuning for different sizes or tasks. BitCPM-CANN integrates low-bit capabilities into the MindSpeed training infrastructure. Based on the Megatron-LM framework, it embeds pluggable QAT parallel linear layers, unifies model storage formats, and supports 32K long-sequence training. The main scheme uses QAT with post-training distillation, reducing training throughput by only 5% with nearly no additional cost. This marks the first time domestic NPUs have their own 1.58-bit low-bit training stack, eliminating the need for validation in foreign CUDA ecosystems before migration, achieving true infrastructure-level technical precipitate (accumulation). All teams pursuing low-bit training on Ascend can now start directly from this foundation.

Third, its intelligence density is extremely high. Collaborative optimization between Minwall Intelligence and Huawei Ascend shows that using the 1.58-bit training paradigm, model parameter capacity can be increased by about 6x within the same memory capacity. This 6x advantage comes from three layers: storage savings from compressing weights from 16-bit to 1.58-bit, computing power release from replacing floating-point with integer calculations, and deep optimizations by the Ascend team from the instruction set to the operator level.

The scale and maturity of BitCPM-CANN are the result of years of sustained investment. When the industry was still skeptical about ultra-low-bitwidth QAT, Minwall Intelligence had already committed to routes ≤2-bit.

At the time, the gap between domestic and U.S. computing power was vast, with overall AI infrastructure relatively lagging. Domestic chips were insufficient for training large models. To train large models with limited resources, Minwall early on developed the distributed training framework BMTrain. This was not just an engineering implementation on par with DeepSpeed or Megatron but an embodiment of the 'density law'—enabling training of 10-billion-parameter models with just 32 cards or even fewer, significantly lowering the barrier to entry for large models.

Minwall Intelligence and DeepSeek are known in the industry as the two 'companies most adept at architectural improvements,' but their battlefields are vastly different: DeepSeek focuses on cloud-side high-computing-power scenarios, squeezing every bit of value from 10,000-card clusters; Minwall targets single edge chips, pursuing extreme efficiency under strict constraints of power consumption, heat dissipation, and memory bandwidth. While many large model companies adopt conservative traditional architectures, training multiple model sizes with the same data, Minwall optimizes extensively for edge chip characteristics, including sparse computing and near-memory computing. Through long-term exploration, the team has refined a stable, transferable hyperparameter methodology by iteratively testing core variables like learning rates, distillation strategies, and data ratios.

These foundational accumulations ultimately enabled BitCPM-CANN to 'invent a new script' on Ascend.

If the previous sections discussed 'the problems of old scripts' and 'how to create new scripts,' this section broadens the perspective to examine the industrial significance of BitCPM-CANN's open-sourcing of this 'new script.' It is not just about running a single model but transforming the 6x memory efficiency gain on the inference side into a reusable capability, turning low-bit training into a migratable, scalable, and continuously optimizable foundation for Ascend.

First, consider the technological gap it fills. BitCPM-CANN addresses a long-standing void in domestic AI chips. For years, domestic AI chips faced an awkward situation: impressive hardware parameters, but the 'script system' supporting them remained controlled by others. The CUDA ecosystem is like a mature Latin alphabet system—useful but dependent on external control. For true autonomy, domestic computing power must possess its own 'script-creation' capabilities, mastering everything from underlying algorithms to training frameworks.

BitCPM-CANN breaks this deadlock. It is the first public, systematic 1.58-bit ternary training adaptation on a domestic NPU platform, scaled to 8B in one go and evaluated 1:1 against Minwall Intelligence's full-precision model family. This marks the first time the industry can see a complete capability spectrum of a low-bit model on domestic computing power.

Shifting focus to the edge AI industry, technology ultimately serves scenarios, and BitCPM-CANN's value extends far beyond the Ascend platform itself.

Looking beyond technology to deployment scenarios, BitCPM-CANN's value transcends the Ascend platform, addressing the core needs of the entire edge AI industry. Technology's ultimate destination is real-world applications, and BitCPM-CANN precisely targets the most pressing pain points of terminals like phones, PCs, and cars. For terminal manufacturers, combining 1.58-bit ternary models with MoE technology could enable 60B-level model capabilities to fit into phones. More critically, compared to traditional BF16 formats, actual memory savings reach 6x, allowing devices to carry stronger AI capabilities without additional physical memory—a crucial advantage amid rising global memory prices and stubbornly high hardware costs.

Meanwhile, there is also a noticeable mismatch between supply and demand in the industry. Qualcomm's new-generation chip platform already supports 2-bit native inference, with the hardware ready, but the market has consistently lacked truly deployable, stable, and usable low-bit weight solutions. The open-sourcing of BitCPM-CANN precisely fills this gap, enabling the hardware capabilities of chips to be fully utilized and allowing ordinary developers to experience the real performance of domestic computing power in low-bit scenarios with zero barriers. This kind of 'two-way convergence' between models and chips represents the true starting line for edge AI to move out of the lab and achieve large-scale deployment.

Moreover, as the world's first ternary model natively completed entirely on domestic computing power, it proves that Ascend can train not only large models but also extremely low-bit large models, rewriting the external perception that domestic chips are 'inference-heavy but training-light.' It achieves a complete synergy among domestic NPUs, domestic AI models, and domestic training frameworks, demonstrating that without relying on overseas computing power or the CUDA ecosystem, Chinese teams can still create world-class AI 'new scripts.'

Shifting focus back to MinWall Intelligence itself, BitCPM-CANN marks a clear watershed in its growth trajectory.

Prior to this, MinWall Intelligence was positioned in the industry as a large model company dedicated to AGI. While the industry was generally still chasing parameter scale, cloud-based competition, and leaderboard rankings, MinWall Intelligence had already completed its accumulation from the underlying training framework to edge-side compression routes, long establishing itself as the definer of China's edge-side large model technology roadmap.

The open-sourcing of BitCPM-CANN is not just a simple release of achievements; it also sends a clear technological signal: the core contradiction of edge-side large models lies in memory and efficiency, and the solution path should point to the reconstruction of the compression paradigm itself. Rather than choosing to follow overseas routes as an adapter, MinWall Intelligence opted to become a rule-setter on the more challenging and foundational path of extremely low-bit technology.

The essence of this watershed is that MinWall Intelligence has completed an identity leap from a model provider to a definer of technological methodologies.

Of course, establishing authority never relies on a single breakthrough but on systematic output. BitCPM-CANN is just the tip of the iceberg; beneath the surface lies MinWall Intelligence's complete ecosystem, ranging from BM-Train to MindSpeed, and from low-bit methodologies to edge-side deployment closures.

Looking back, the true significance of BitCPM-CANN lies in providing a verifiable starting point for domestic computing power in the direction of extremely low-bit training. This 'new script' has already been written, with dictionaries and sample texts open-sourced. The creation of more remarkable works now depends on the subsequent efforts of the industry, but at least, the pen has been handed to everyone.

Solemnly declare: the copyright of this article belongs to the original author. The reprinted article is only for the purpose of spreading more information. If the author's information is marked incorrectly, please contact us immediately to modify or delete it. Thank you.

Newest

Links