Can Large Models Be Etched into Chips?

Home

Finance

ICV

Smart City

Digital Live

Cloud

Optics

Home Finance AI ICV Smart City Digital Live Cloud Optics

02/26 2026 489

Recently, a chip company named Taalas has emerged, drawing significant attention from the industry.

Founded in 2023, the Toronto-based startup Taalas was established by industry veterans, including Ljubisa Bajic. With its HC1 chip, it has disrupted the AI hardware market. The company has broken away from traditional AI hardware design by directly etching the weights of large AI models into the chip's metal interconnect layers, achieving extreme memory-computation integration. This enables the chip to reach an inference speed of 17,000 tokens per second, far surpassing NVIDIA's H200 (~230 tokens/sec) and B200 (~2,000 tokens/sec). This innovative approach has prompted the industry to reconsider: Is etching large models directly into chips a new direction to break AI hardware bottlenecks, or merely a niche attempt limited by technological iteration?

Sacrificing Universality for Ultimate Performance and Energy Efficiency

The HC1 chip from Taalas essentially abandons the universal approach of "one chip for all models" and instead adopts a "custom silicon structure for specific models." Utilizing TSMC's 6nm process and Mask ROM technology, the chip hardcodes model weights directly onto the silicon wafer, eliminating data movement between computation and storage at the physical level and significantly addressing the industry's memory wall problem. Additionally, it abandons liquid cooling and HBM memory, opting for air cooling instead, which reduces power consumption and hardware costs. The accompanying software stack is also highly simplified due to the hardware fixation of model weights and structure, eliminating the need for complex optimization layers and further enhancing performance and energy efficiency.

This extreme customization gives the HC1 chip significant advantages in performance and cost: its token processing speed is nearly 10 times that of NVIDIA's most powerful GPU, with hardware costs only 1/20th and power consumption reduced to 1/10th of traditional GPU solutions. However, this comes at the cost of complete loss of universality—the HC1 chip can only run the specific Llama 3.1 8B model. Any model updates or iterations require re-taping out the chip. Of course, this extreme specialization can also be extended to larger models. Taalas provided simulated data for DeepSeek R1 671B. A 671B parameter model would require approximately 30 chips working in tandem, with each chip handling about 20B parameters (using MXFP4 format and separating SRAM into independent chips to increase density). Thirty chips mean 30 incremental tape-outs, but Bajic pointed out that since only two mask layers are changed each time, the cost of incremental tape-outs is not high.

This characteristic also determines Taalas's market positioning. Rather than aiming to become the "next NVIDIA," it targets niche segments of AI inference to become a dedicated supplier in this field. Its approach is similar to Groq's LPU, but it goes even further in specialization.

Currently, Taalas is still exploring its business model, with three main possibilities: building infrastructure to provide API services, selling chips directly, and collaborating with model developers to customize dedicated chips. Whether this extremely specialized solution will be accepted by the market mainly depends on the latency sensitivity of specific application scenarios and the long-term stability of the models themselves. Despite its obvious limitations, for scenarios highly sensitive to latency and with relatively stable models, such as high-frequency financial trading, autonomous driving, and military equipment, the HC1 chip's technical solution still holds irreplaceable value.

Exploring Diverse Technical Routes in the Inference Chip Race

In the AI hardware field, GPUs remain unbeatable in training, but their "expensive and slow" shortcomings in inference have made inference chips a new arena for innovation among startups. Besides Taalas's hardcoding approach, the industry has seen the emergence of various technical routes, each sacrificing a traditional design element to achieve performance breakthroughs in inference, resulting in distinct technical explorations.

Taalas chooses to abandon software and adopt a hardwired approach, turning model weights and data flows directly into physical connections. In its design logic, software is purely overhead, instruction sets are wasteful, and even compilers are unnecessary—once the model is determined, the chip is directly taped out. This design minimizes power consumption and cost but reduces fault tolerance to zero, as any model changes render the chip useless.

Etched chooses to etch the architecture into the chip. Their first AI chip is a dedicated integrated circuit (ASIC) that claims to outperform NVIDIA's H100 in large language model (LLM) inference. As an ASIC, Etched hardcodes the transformer architecture onto the chip. By directly hardcoding the computational logic of transformers—attention mechanisms, matrix multiplications, and activation functions—into the chip's circuit design, efficiency is exponentially improved. However, this also means a complete loss of flexibility: it cannot run recurrent neural networks (RNNs), recommendation system models, or any non-transformer AI tasks.

Groq has introduced its exclusive LPU (Language Processing Unit), which adopts a pure SRAM architecture and abandons traditional design elements such as hardware schedulers, cache coherence protocols, and branch prediction. Its core logic is to maintain 100% hardware determinism, with data transmission and computation entirely planned in advance at the cycle level by the compiler in software. This approach enables extremely fast inference speeds during batch processing (Batch=1), and Groq's core competitiveness lies not in the chip itself but in the compiler software capable of scheduling massive parallel instructions.

Cerebras's core product, the WSE (Wafer-Scale Engine), breaks away from traditional chip-cutting approaches by using an entire wafer as a single large chip, integrating massive amounts of SRAM and computational cores. The underlying logic of this design is to address the memory wall problem of inter-chip data transmission at the physical level, as data interaction between chips is the slowest and most energy-intensive link . This approach grants the chip unparalleled bandwidth but also pushes the physical engineering challenges of manufacturing, cooling, and fault tolerance to the extreme.

Tenstorrent (founded by chip legend Jim Keller) embraces open-source and decoupling, adopting a RISC-V instruction set paired with matrix computation units (Tensix) to create a highly programmable dataflow architecture. This company is the least "ASIC-like" among the four. Jim Keller believes that AI algorithms are still rapidly iterating, and hardware design must not be rigid. Therefore, it uses a flexible RISC-V instruction set to handle control flows and connects thousands of small chips through a heterogeneous network. This approach brings the chip closer to a "general-purpose computer," betting that future AI will not be a single transformer architecture but will evolve into complex software engineering involving extensive conditional judgments and logical reasoning.

Looking Back at History: Lessons from Hardwired Hardware and Binding Risks at Different Levels

The idea of etching programs into hardware is not new to Taalas; there have been precedents in the history of technology, and the rise and fall of these precedents provide important references for the approach of etching large models into chips.

In the late 1990s, 3dfx's Voodoo graphics card was a benchmark in the 3D graphics field. Its success and failure stemmed from the same design logic—fully hardwareizing the steps of 3D rendering. It turned the rasterization steps of 3D games (vertex matrices, lighting, texturing, etc.) into a "fixed pipeline" hardcoded into the circuit, giving it a speed advantage over contemporaneous products in running 3D games and making it synonymous with 3D graphics cards. However, after 1999, developers began exploring richer 3D effects, such as water reflections and skin textures, which the Voodoo graphics card, due to its hardware fixation, could not support. It was eventually replaced by NVIDIA's GeForce GPU, which introduced "programmable shaders," leading to 3dfx's bankruptcy and acquisition by NVIDIA.

From 2016 to 2018, the main AI algorithms were CNNs (Convolutional Neural Networks) for image recognition. A large number of chip startups designed specialized "convolutional acceleration engines" in hardware for CNN computation patterns. These chips were fast and energy-efficient in processing image recognition tasks for facial recognition and autonomous driving, highly similar to Taalas's current approach. However, the release of "Attention is All You Need" in 2017 and the emergence of BERT in 2018 shifted the underlying mathematical logic of large models from "local convolutions" to "global self-attention mechanisms." Companies that hardcoded CNN logic into their chips lacked general matrix computation capabilities, resulting in extremely low efficiency or even inability to run transformer architectures. This led to the demise of most startups focus on specific visual algorithms in the first wave of AI chip hype.

Comparing these two cases with Taalas's approach reveals an essential difference in the degree of hardware fixation: Voodoo graphics cards fixed the rendering pipeline, and even with technological iteration, they could still be used for 3D games, albeit with outdated visual effects; CNN chips fixed the algorithm and could still function in traditional scenarios like facial recognition, but their applicable scenarios were greatly narrowed; Taalas fixes a specific model, and once the model is updated, the chip becomes directly obsolete. This extreme binding also places the greatest risk on Taalas's approach—it bets that AI algorithms have entered a "plateau phase" with little room for breakthrough progress in architecture. However, the reality is that the current iteration cycle of AI models is even measured in weeks. As long as industry competition continues, there will be no standard for models, and technological changes in the frontier AI field will always be the Sword of Damocles hanging over this chip.

Not a Universal Solution, But Valuable in Specific Scenarios

From the perspective of frontier AI research, the technical route of etching large models into chips is clearly not feasible, but this does not mean the approach has no market. In many scenarios with relatively fixed model demands, it precisely addresses the pain point of excessive inference latency in large models, demonstrating unique application value.

In the industrial sector, the trend of deploying large models in workshops is growing. Many scenarios do not require top-performing large models; distilled lightweight models (such as qwen2.5) can solve problems that traditionally required customized software development. These scenarios prioritize model stability over iteration speed, and etching lightweight models into chips can perfectly solve the issue of inference latency. In government systems' large model applications, models are typically disconnected from the internet after deployment and cannot be updated online. In such cases, hardcoding the model into hardware and replacing the hardware for subsequent model updates is more convenient than software updates. In the consumer electronics sector, integrating dedicated chips for small models like translation and TTS into devices such as smartphones can provide guaranteed offline intelligence capabilities. Even if superior models emerge later, they can still meet users' basic needs, while significantly improving device battery life compared to running small models directly on the phone.

In special scenarios with extreme requirements for latency and offline operation capabilities, the approach of etching large models into chips demonstrates irreplaceable advantages, becoming one of the best choices in these fields. In the autonomous driving sector, vehicles encountering unexpected situations like temporary road closures or on-site traffic police directions require "instinctive logical reasoning reflexes" with latencies below 1 millisecond. Traditional autonomous driving chips excel at rapid image recognition but cannot handle such complex logical reasoning, while cloud-based large models suffer from latency in feedback. Dedicated chips with hardcoded large models can achieve local ultra-fast inference to handle various unexpected situations. In high-frequency quantitative finance, the release of information such as Federal Reserve statements, non-farm payrolls, and corporate earnings reports often triggers rapid market fluctuations. Dedicated chips can quickly parse this information, judge market trends, and convert them into trading signals at extremely high speeds, essentially providing an "edge" for quantitative trading. In the military sector, large model chips hardcoded and isolated in physical hardware can operate independently in disconnected "information islands," relying on their massive parameters at the time of manufacture to perform rapid tactical analysis and confidential decision-making.

At the same time, there is no need to worry that large models hardcoded into chips will remain stuck at the knowledge level of their manufacture due to an inability to iterate. In reality, what is hardcoded into the chip is only the model's architecture and weights; while the model cannot evolve further, its logical reasoning and knowledge retrieval capabilities remain top-tier. Moreover, hardcoded large models are not deprived of internet connectivity; they can still access the latest information online for analysis and problem-solving, just without upgrading their capabilities through model iteration.

Controversy and Future Possibilities: A Game of Iteration Cycles

The future development prospects of the approach of etching large models into chips will revolve around a game between iteration cycles and costs, which is also the main point of controversy in the industry.

One of Taalas's competitive advantages is its claim to shorten the cycle of "converting large models into custom chips" from the traditional one year to just two months. Additionally, by abandoning expensive HBM chips and adopting a 6nm process for dedicated chips, its hardware costs are only 1/20th of GPU solutions like NVIDIA's H100. From a cost perspective, this approach has significant advantages. Based on a processing speed of 17,000 tps, a single HC1 chip's processing capability is comparable to an NVIDIA 8-card server. As long as the total cost per chip does not exceed $10,000, it will have strong market competitiveness.

The real issue lies in the iteration cycle. Even though a two-month tape-out cycle is already a significant reduction, current AI model iterations occur roughly monthly. Two months are enough for competitors to launch newer-generation models, leaving the chip obsolete upon mass production—this is the most fatal flaw of this approach. Moreover, binding the fastest-iterating and most unstable element in the software era (models) with the slowest-iterating and most stable element in the hardware era is essentially sacrificing technological abstraction for short-term impressive performance data, which is the main question of this approach in the industry.

However, Taalas has designed LoRa mounting capabilities for its chips, which can partially compensate for the inability to iterate models. Additionally, the essence of this approach is an economic issue. As large model technology develops, its architecture and capabilities will eventually approach their limits, and model update cycles will gradually lengthen. When model iteration speeds slow down below chip tape-out speeds, the economic value of this approach will become prominent. Taalas is betting that large model technology will enter a stagnation phase. When the technology matures and models no longer require frequent base updates, its early-laid dedicated chip solutions will become industry-leading.

Conclusion

Integrating large models into chips is not a universally applicable technical route that can disrupt the AI hardware market. In the field of cutting-edge AI research and development, due to the rapid iteration of models, the limitations of this approach are magnified indefinitely, making it difficult to become mainstream. However, it cannot be denied that Taalas's attempt provides a fresh perspective for the development of AI chips. Its approach of sacrificing versatility for ultimate performance and energy efficiency precisely meets the market demands in the niche segment of AI inference, offering crucial insights for the design direction of integrated memory-computing and customized hardware.

The future of this technical route ultimately depends on the balance between the iteration speed of AI models and the demands of industry scenarios. When large model technology enters a stable phase, fixed scenarios with extreme requirements for latency and offline operation will eventually become the market for dedicated chips for large models. Even if technological iteration continues at a rapid pace, the innovative thinking behind this approach will drive the industry to continually explore more efficient AI hardware designs, propelling the development of AI hardware towards diversification and scenario-specific directions.

Solemnly declare: the copyright of this article belongs to the original author. The reprinted article is only for the purpose of spreading more information. If the author's information is marked incorrectly, please contact us immediately to modify or delete it. Thank you.

Newest

Links