AI Protein Folding: Navigating Through the Universe of Life and Glimpsing the Splendor of the Bioeconomy

09/17 2024 487

In "The Cambridge World History of Food" published in 2000, an anecdote is recorded: In 1728, Italian scholar Jacopo Beccari announced the discovery of something in white flour that possessed all the characteristics of "animal matter." His method involved kneading and washing raw dough in water to remove fine white starch granules, leaving behind a sticky gluten mass. As no one knew where it came from, people assumed it originated from animals. Beccari believed that these "animal matter" components made wheat particularly nutritious. However, as a whole, flour did not exhibit animal matter properties due to the overwhelming presence of starch.

Beccari's crude research, seemingly simplistic by modern standards, inadvertently opened a door to the microscopic world of life for future generations. A century later, in 1838, Dutch physician Gerritt Mulder published an article stating that all significant "animal matter" he analyzed shared the same basic composition: 40 carbon atoms, 62 hydrogen atoms, 10 nitrogen atoms, and 12 oxygen atoms, represented as C40H62N10O12. These "animal matter" components exhibited different properties solely due to the number of sulfur or phosphorus atoms attached to them. Naming it after Proteus, the Greek sea god known for his shape-shifting abilities, Mulder coined the term "protein." His research initially established protein as one of the fundamental building blocks of animals and plants.

As history progressed, humanity began to understand and study life at the molecular level in the 20th century. Alongside the revelation of DNA's secrets, which sparked a significant leap forward in the field of life sciences, the importance of proteins as the material basis of life and primary executors of biological activities gradually emerged for scientists. Protein research, particularly on their three-dimensional structures, advanced slowly for a long time before AI finally cracked the code in the first two decades of the 21st century. "AI Protein Folding" became a pivotal achievement in life sciences and scientific research as a whole.

A new panorama of the bioeconomy gradually unfolds before us: utilizing AI to design proteins (rather than selecting them from nature) and produce products tailored to human needs, such as medicines, foods, condiments, new materials, nutritional supplements, cosmetics, and more. This shift drives a transition from a society based on highly polluting and energy-intensive chemical raw materials towards a novel, green, and sustainable bio-based society, a relentless pursuit of scientists and industries alike.

Returning to the present, we often speak of humanity navigating two universes in the 21st century: one outward, towards the depths of the cosmos; the other inward, towards the mysteries of life sciences. China's rapid rise in aerospace engineering has reignited humanity's long-stalled exploration of space. Similarly, the spaceship called "AI Protein Folding" has also taken off, powered by Chinese scholars, breaking through the atmosphere of humanity's exploration into the universe of life.

Midway through the year, it's a time for reflection and summary. Let's look back together at its origins, launch, and future trajectory.

Let's rewind to the beginning and reacquaint ourselves with something both familiar and mysterious: proteins.

They are familiar because, in today's affluent world, the term "protein" appears frequently. Articles and videos on diet and health constantly remind us that certain products are rich in specific proteins, making it clear that proteins are essential nutrients for our bodies. Yet, they remain mysterious because most people have a limited understanding of their roles, values, and underlying mechanisms.

From a life sciences perspective, proteins are one of the four primary macromolecules in living organisms (alongside nucleic acids, polysaccharides, and lipids). DNA, as the carrier of genetic information, stores hereditary data. Research, technology, and applications surrounding DNA constitute one of the most significant advancements in human life sciences in the 20th century. From the discovery of the double helix structure in the 1950s to the emergence of various cutting-edge medical technologies today, the story of DNA is no longer alien to us.

However, what many people overlook is that genetic information must be transcribed and translated into proteins to execute various functions in living organisms. Proteins are indispensable for all life activities, including growth, development, movement, heredity, reproduction, and more. They are thus termed the "material basis of life" and "primary executors of biological activities."

So, how do proteins perform such diverse roles?

The answer lies in their rich and complex spatial structures, which dictate their functions. Proteins consist of amino acids as basic building blocks. The different arrangements (sequences) of amino acids and their subsequent folding form specific three-dimensional structures, enabling various functions. With over 20 known amino acids that can link in any order and length to fold into distinct proteins, theoretically, the number of possible proteins exceeds 10^1300, vastly surpassing the number of atoms in the universe, contributing to their rich and complex functionality.

In an ideal scenario, deciphering the three-dimensional structures of proteins formed by amino acids during folding would enable a profound understanding of their roles and mechanisms. Furthermore, designing, modifying, or even creating novel proteins with specific functions based on this knowledge would yield immeasurable value. For instance, in drug development, targets, antibodies, peptides, protein vaccines, and fusion proteins are all proteins. Designing innovative protein-based drugs could significantly increase the chances of addressing complex medical challenges. In food science, developing high-quality, safe, and affordable alternative protein sources could enrich human nutrition and address food shortages. In materials science, optimizing proteins to create biodegradable and recyclable eco-friendly biomaterials could promote sustainable development.

Yet, the reality is far from ideal. Merely unraveling the composition and structure of proteins has consumed nearly a century of scientific endeavor. Hermann Emil Fischer, the 1902 Nobel Laureate in Chemistry, pioneered the theory that amino acids link through peptide bonds to form proteins, initiating protein structure research. However, it wasn't until half a century later, in 1959, that Max Perutz and John C. Kendrew, using emerging X-ray crystallography, first visualized the details of hemoglobin and myoglobin molecules, earning them the 1962 Nobel Prize in Chemistry. Concurrently, Christian Boehmer Anfinsen proposed that all information necessary for a protein's final conformation is encoded in its amino acid sequence, a theory known as the "Anfinsen's dogma," which laid the foundation for protein structure prediction and earned him the 1972 Nobel Prize in Chemistry.

Subsequent protein structure research progressed slowly over half a century. Scientists employed various experimental techniques, including X-ray crystallography, nuclear magnetic resonance, and cryo-electron microscopy, to determine protein three-dimensional coordinates. However, these methods were time-consuming, expensive, and had low success rates. Coupled with the vast diversity of proteins, experimental efforts remained insufficient.

In the 1990s, the rise of computer science introduced energy optimization-based computational methods. Rooted in the "Anfinsen's dogma" that proteins fold into their lowest energy state, these methods aimed to predict protein structures by optimizing their energy profiles. By teaching computers this approach, energy could be incrementally optimized to forecast protein structures.

While these energy optimization methods yielded some results, predictions remained unsatisfactory, significantly deviating from experimental outcomes. Proteins are complex systems composed of thousands of atoms, necessitating vast search spaces and diverse conformations. Furthermore, while researchers agree that proteins fold to their lowest energy state, there is no consensus on the specific form of the energy function.

The significant research value, coupled with limited means and sluggish progress, has made protein structure research the "crown jewel" of modern molecular biology. Within the last four decades of the 20th century alone, protein-related achievements garnered seven Nobel Prizes, underscoring its difficulty and value.

Thus, delving deeper into the vast protein universe to unravel more life mysteries emerges as a clear navigation point in the exploration of the life cosmos.

Entering the 21st century, machine learning gradually emerged as a crucial research direction in computer science, influencing protein structure research. Traditional machine learning methods directly mapped protein amino acid sequences to three-dimensional conformations, slightly outperforming physics- or statistics-based approaches but failing to achieve a fundamental shift.

Then, a new key emerged.

A milestone in AI was the emergence of deep learning. In September 2012, Geoffrey Hinton et al. published "ImageNet Classification with Deep Convolutional Neural Networks," introducing AlexNet, which won the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) 2012 with a substantial margin, setting new records. This propelled deep learning into the limelight, reigniting AI after years of dormancy, marking the third AI resurgence post-winter.

Deep learning algorithms employ multi-layer neural networks mimicking brain neuron functionality. Their strength lies in focusing on overall input and output results rather than individual computational steps. In protein structure research, researchers provide amino acid sequences and corresponding structures to train AI to autonomously predict protein structures. This approach transcends conventional thinking, fully activating the feasibility of AI-based protein structure prediction.

Some scientists persevering in protein structure research had keenly noticed this new tool, but initial attempts yielded marginal improvements over traditional machine learning methods. The first encounter between deep learning and protein analysis did not produce overwhelming results.

It was a Chinese scholar's research that truly ushered in a dawn for this field.

In 2014, Jinbo Xu, a professor at the Toyota Technological Institute at Chicago, designed a new deep learning algorithm to tackle a simpler problem—predicting protein secondary structures, focusing on the spatial arrangement of peptide backbone atoms rather than amino acid side chains. Tests confirmed its effectiveness. In 2015 and 2016, Professor Xu further developed improved deep learning algorithms directly predicting protein three-dimensional structures.

In summer 2016, Professor Xu's RaptorX-Contact algorithm demonstrated that deep residual convolutional neural networks significantly enhanced protein structure prediction performance. At the Critical Assessment of Protein Structure Prediction (CASP12) competition that year, RaptorX-Contact scored highest in predicting protein contact maps, garnering academic attention. Prior to this, CASP averages hovered around 30 points, but Professor Xu's algorithm soared to 60 points, achieving a groundbreaking breakthrough. The research was published in PLoS Computational Biology in 2017 and later received the PLoS Computational Biology Innovation Award.

The door to the universe of proteins has finally been opened by the key of 'AI'.

Since then, Professor Xu Jinbo has continued to optimize and promote this algorithm, and his core idea has been quickly adopted by other researchers in the industry, who have successively used it in the development of various AI protein folding algorithms. At one point, there was an explosion of research results on protein structures using deep learning methods. Professor Xu Jinbo himself soon realized in subsequent research that predicting the distance between amino acids cannot be done pair by pair, but must be done for all pairs together, and once again took the lead in developing an end-to-end model. The relevant results were published in the Proceedings of the National Academy of Sciences (PNAS) in August 2019, marking the first time globally that AI has been applied to predicting the distance between amino acids (atoms) in proteins, further improving the accuracy of protein three-dimensional structure prediction and enabling scientists to complete the task using only a laptop, pushing AI protein structure prediction to new heights.

As for the rest of the story, everyone knows it. DeepMind's AlphaFold 2, introduced at CASP 14 in 2020, achieved predictions of most protein structures that were only one atom's width away from the true structures, reaching the level of human observations and predictions using complex instruments such as cryo-electron microscopes, causing a tsunami-like sensation in the global scientific community. That year, AI protein structure prediction was named one of the top ten scientific breakthroughs by Science magazine, and was again named the top scientific breakthrough in 2021, and selected as one of MIT Technology Review's top ten breakthrough technologies in 2022.

However, there is another little-known story during this period.

In the fall of 2016, Professor Xu Jinbo held a small report meeting to introduce the research results of RaptorX-Contact to academics. One of the participants was John Jumper, a postdoctoral fellow at the University of Chicago's Biophysics Department who later led the DeepMind team and designed AlphaFold. After listening to the report, the latter fully embraced deep learning methods and joined DeepMind within a month or two.

Later, it was generally believed in the industry that the early versions of AlphaFold did not introduce many innovations in their implementation, but were based on the algorithmic ideas of RaptorX-Contact. The key idea in AlphaFold 2, namely the end-to-end model, which directly outputs the three-dimensional structure based on the characteristics of the sequence, is similarly in line with the research results published by Professor Xu Jinbo in 2019. Because of this, the achievements of AlphaFold also caused some controversy in the industry: compared to scientific research activities in university campuses, do the achievements of commercial laboratories supported by large enterprises represent more sophisticated engineering techniques rather than innovative scientific insights?

Of course, there is now a consensus on this historical period. John Moult, the founder of the CASP competition and a professor in the Department of Cell Biology and Molecular Genetics at the University of Maryland, once said, "DeepMind has done a great job in developing a very effective method. However, the concepts and methods behind this work did not come out of thin air; the key technology is the application of deep learning methods. There is no doubt that DeepMind is directly built on the work of Xu Jinbo."

At present, the tremendous impact of AlphaFold on the life sciences cannot be denied. However, the pioneering and groundbreaking achievements of Chinese scholar Xu Jinbo in promoting AI protein research and AI for Science should also not be forgotten.

As mentioned in the first part of this article, determining the three-dimensional structure of proteins will greatly aid our understanding of how life functions and explore the mysteries of life. On this basis, if we can redesign proteins to perform specific functions or even generate entirely new proteins, the value will be immeasurable. In this regard, Professor Xu Jinbo and his RaptorX-Contact have made a start, but this is just the beginning. After all, there are still many unknowns waiting to be discovered in the vast universe of life. For example, optimizing AI protein structure prediction methods to uncover more protein structures and gain a deeper understanding of how life functions; or more imaginative AI protein optimization and design with applications in mind.

Since AlphaFold 2 excels at predicting and calculating the structure of individual proteins, it can only make predictions and is highly dependent on MSA (multiple sequence alignment from homologous proteins) and its derived co-evolutionary information and sequence profiles. However, the immense complexity of the protein world means that there is still much room for exploration in protein structure prediction, such as the interaction of proteins with other molecules, the impact of single-point mutations on protein structure and function, orphan protein structure prediction, and protein side-chain prediction. Therefore, even in the field of AI protein structure prediction, cutting-edge achievements continue to emerge after the advent of AlphaFold 2.

For example, in 2021, David Baker, a recipient of the Breakthrough Prize in Science and a professor at the University of Washington known as the "Hand of God," led teams from the University of Washington, Harvard University, and the University of Texas Southwestern Medical Center, among others, to release the AI tool RoseTTAFold, which boasts ultra-high accuracy comparable to AlphaFold 2 in protein structure prediction, but is faster and requires less computing power. It can not only predict the structure of individual proteins but also predict the structure of protein complexes. However, like AlphaFold 2, it relies on MSA and templates of similar protein structures to achieve optimal performance. In 2022, META also launched ESMFold, which is comparable to AlphaFold 2 in predicting the three-dimensional structure of proteins and can predict the structures of orphan proteins, with a computational speed an order of magnitude faster than AlphaFold 2 and significantly better accuracy when single sequences are input. However, META later disbanded the team and stopped investing heavily in this area. Besides these two well-known teams in the industry, there are still research and development teams continuously achieving results that surpass previous achievements in areas where AlphaFold 2 does not perform well.

Here's a little aside. On May 8, 2024, DeepMind, a subsidiary of Google, and Isomorphic Labs officially released AlphaFold 3, the latest AI model in the field of protein structure prediction. DeepMind claims that AlphaFold 3 can predict the structure of complexes containing almost all molecular types in the Protein Data Bank, including how ligands (small molecules), proteins, and nucleic acids (DNA and RNA) aggregate and interact, as well as predict the structural effects of post-translational modifications and ions on these molecular systems, thereby helping us observe the structure of biomolecular systems precisely at the atomic level. However, this new version is temporarily closed-source and will only be made available to academia six months later, along with the code and model weights. Therefore, the extent to which the new version surpasses its predecessor remains to be seen.

While AI protein structure prediction continues to make breakthroughs, some far-sighted scientists have turned their attention to AI protein optimization and design with greater industrial application value.

Taking biomedical drugs as an example, previously, the development of biopharmaceuticals was somewhat limited due to insufficient understanding of protein structure and function. However, if AI can be used to optimize and design proteins, it may accelerate the improvement of protein drug properties and achieve more desirable functions. AI can even be used to rapidly generate new protein drugs or even entirely new drug molecules that do not exist in nature based on specific targets, potentially curing diseases that were previously considered untreatable.

Similarly, in broader fields such as synthetic biology, agriculture, food, and new materials, there is even more imagination in AI protein optimization and design technologies. For example, in the rapidly developing field of synthetic biology, enzymes (also a type of protein) are widely used for biocatalysis. If the structure and function of enzymes can be designed and modified to improve catalytic efficiency, stability, and selectivity, it will greatly enhance the efficiency of biosynthesis, catalysis, and conversion. Alternatively, proteins with specific functions can be directly designed, such as developing alternative protein foods that are easier for the human body to absorb and more nutritious, developing green biopesticides that are safe for humans and environmentally friendly, developing powerful plastic-degrading catalysts to help eliminate pollution, creating more ductile and tough fiber materials to improve the aerospace industry, and enhancing crop yields, quality, and cultivating more environmentally friendly and high-yielding crop products. There are so many application directions yet to be explored with powerful protein optimization and design tools.

However, compared to protein structure prediction, protein design is a more challenging problem.

First, the protein sequence space is vast. There are over 20 types of amino acids in nature. Assuming we need to design a protein with 100 amino acids, there are 20^100 possible sequences for this protein. However, only a tiny fraction of these sequences can stably fold and possess the specific functions we desire. Therefore, finding a suitable amino acid sequence in this vast space is like looking for a needle in a haystack.

Second, designing proteins based on specific functions requires a deep understanding of protein structure and function, which remains a challenge for scientists and industry.

Third, the diverse and complex demands of industry for proteins, such as designing protein drugs based on specific targets, designing enzymes that can catalyze specific substrates, or improving the catalytic efficiency of existing enzymes, undoubtedly increase the complexity of protein design research.

Take P450 enzymes (CYP), known as "universal biocatalysts," as an example. As a vast enzyme family widely distributed in organisms (consisting of multiple families, subfamilies, and individual enzymes with high diversity and complexity), they can catalyze various reaction types and recognize a wide range of substrates (substances that can undergo biochemical reactions with them), making them extremely potential in drug synthesis and synthetic biology. Since naturally occurring P450 enzymes cannot perfectly meet industrial demands, the need arises to modify existing enzymes or design new P450 enzymes with new functions to broaden their application range. However, most P450 proteins are about 400-500 amino acids long, meaning that the possibility of designing new P450 enzymes reaches 20^400-20^500, far exceeding the number of atoms in the universe (estimated to be 10^78x10^82), making it nearly impossible to find the right one. Moreover, since the catalytic reactions of P450 enzymes require compatible coenzymes, designing new P450 enzymes with new functions also requires considering the interactions with other proteins, exponentially increasing the complexity of designing new enzymes.

Before AI technology, the scientific community was already using methods to search for potentially valuable protein molecules in the vast universe of proteins and to optimize and design protein molecules for better human use, such as directed evolution and rational design. The former mainly simulates the process of natural selection, involving multiple rounds of mutation and screening experiments on target genes until the desired superior variants are obtained. However, this technique is limited by low screening rates and the vast number of variants in sequence space. The latter selects fewer critical sites for precise modification based on sequence and structural information to construct a smaller mutation library, but it requires in-depth understanding of structural and functional information, and cannot be adjusted when experimental results do not match predictions. For enzymes like P450, researchers may spend their entire careers trying to find ideal new molecules without success. Since P450 enzymes were first discovered in the 1950s, the research community has never been able to artificially design new molecules, only partially modifying and optimizing existing ones. The research community needs more powerful tools and methods to design proteins that meet requirements faster and more accurately.

Since 2018, Professor Xu Jinbo has taken the lead in expanding his research scope to AI protein optimization and de novo design, introducing a pre-training mechanism to further explore the industrial application path of AI protein technology. He has successively launched more than ten technologies, such as an algorithm that can be used for both protein side-chain prediction and sequence design, a single-sequence structure prediction algorithm with performance comparable to ESMfold, and a complex prediction algorithm with accuracy surpassing AlphaFold 3. He has also innovatively integrated AI with molecular dynamics, quantum chemistry, and other technologies to solve scientific and industrial problems. These technologies have not only demonstrated world-leading performance in tests but have also been verified in wet experiments and quickly adopted by multinational pharmaceutical companies and biotechnology companies. At the end of 2021, he returned to China and founded MoleculeOS, the industry's first fully functional AI protein optimization and design platform, to promote the rapid realization of greater application and social value of related research results.

Besides Professor Xu Jinbo, other teams have also been successively publishing AI protein design algorithms and exploring the generation of various functional proteins, although the results are limited to the computational level, and no industrial application results have been announced. In September 2022, David Baker's team developed ProteinMPNN, a deep learning tool for de novo protein design, which determines the amino acid sequence corresponding to a given protein structure and can rapidly generate entirely new proteins in just a few seconds based on autonomous preferences, although it cannot require the protein to possess certain properties. In July 2023, the team released RoseTTAFold Diffusion, a deep learning method for de novo protein design based on diffusion models, which can generate various functional proteins, including topological structures never seen in natural proteins. However, like ProteinMPNN, it cannot perform precise conditional generation to impart specific properties to proteins. In December 2022, Generate Biomedicines also announced Chroma, a project that uses diffusion models to generate entirely new protein structures not found in nature and even simulated the shapes of 26 English letters and 10 Arabic numerals. However, Chroma cannot generate proteins based on functional requirements and cannot guide how to assess the functionality of generated proteins, so it is more similar to scientific research, and its value for industrial applications remains to be further explored.

AI protein optimization and design tools are springing up like mushrooms, pushing the exploration of the AI protein universe into a deeper space.

As we enter 2023, AI protein research, which has already ventured deep into the unknown, has received a new boost in the form of large models.

At the end of 2022, large language models represented by ChatGPT sparked a new wave of AI enthusiasm, and using AI large models to solve industrial problems has become a new trend. In the eyes of scientists, biology is a highly digitalized system with interpretable and programmable characteristics, so the generative capabilities of large models can also be applied in the field of life sciences, making the two a perfect match.

However, AI large models such as ChatGPT focus on content generation in general domains like text, images, and videos, and cannot meet the in-depth industrial needs for tasks like protein generation. The reason is that the structure formed by protein sequences is much more complex than that of natural language, and the data is also more intricate, involving highly specialized and diverse protein big data. The underlying architecture of modern general large models cannot accurately model these multimodal protein data, and to generate proteins effectively, it is necessary to build newer and more powerful AI modeling technologies from the ground up. Therefore, building large AI protein generation models and improving the efficiency and success rate of protein design have become new directions of industry interest.

In recent years, the research community has gradually produced some results. For example, in 2020, ProGen, jointly developed by AI research institute Salesforce Research, synthetic biology company Tierra Biosciences, and a group of researchers from the University of California, San Francisco, was able to generate protein sequences across multiple protein families with predictable functions in a manner similar to "sentence construction." However, it could only accept sequence signals and not structural signals, and could not simultaneously consider information such as structure, function, interaction, and evolution, resulting in a low success rate and inability to accurately achieve the functionality required for industrial applications. In China, in 2023, BioMap and Tsinghua University jointly proposed xTrimoPGLM, a protein language model with 100 billion parameters, exploring the compatibility and potential for joint optimization between the two types of goals of protein understanding and generation. It can model individual proteins, protein interactions within cells, cells themselves, and cellular systems. In June 2024, Evolutionary Scale AI, founded by former Meta AI researchers, released ESM3, a protein language model that surpasses the capabilities of the aforementioned two models, supporting simultaneous reasoning of sequence, structure, and function. However, it still faces issues such as insufficient generation accuracy, extreme complexity of use, and inability to fine-tune.

The first to achieve validation in industrial applications was Professor Xu Jinbo and his team. Shortly after the establishment of MoleculeAI in 2023, the team launched "NewOrigin (Darwin)," the industry's first industrial-grade AI protein generation large model that integrates sequence, structure, function, and evolution. It not only boasts high success rates and universal applicability, but also circumvents the dependence on traditional methods that rely heavily on large-scale wet experiments (i.e., biological experiments) through computational means, improving production efficiency and reducing costs. Furthermore, it enables biologists without an AI algorithm background to interact with the large model through conversation.

Since then, the team has actively applied it to industrial projects, continuously obtaining feedback and optimizing in industrial practice. In less than a year, multiple industrial application results have been achieved. For example, in the field of biomaterials, using NewOrigin to help partners optimize a key protein that involves an industry bottleneck but has significant commercial value, without using industrial scenario data, AI-designed an important enzyme protein structure that increased the yield of the strain by 5 times compared to wild-type strains, potentially enabling a leap in performance for this protein that has been continuously modified for decades, thereby driving a significant increase in yield and a substantial reduction in costs. In the field of innovative drug research and development, multi-objective optimization was conducted for factors such as the stability and expression level of a certain protein vaccine. Animal experiments showed that the AI-optimized vaccine produced neutralizing antibody titers several times higher than those of publicly disclosed patents and similar vaccines from large pharmaceutical companies, and broke through related vaccine stability patents. Meanwhile, an AI-designed cytokine pipeline maintained tumor-inhibiting activity while detoxifying (reducing peripheral activity) by hundreds of times, and the tolerance dose in monkeys reached tens of times that of similar pipelines... Successful industrial application results have confirmed the powerful capabilities of the AI protein large model.

The initial performance of large models has instilled confidence. With the support of large models, the traditional "trial and error" approach to protein research will be transformed into a new "map-based" approach, and it may even be possible to "invent" entirely new proteins with specific functions from scratch. And through programmable protein design technology, the demands that traditional methods cannot meet will be addressed, significantly enhancing R&D efficiency and reducing costs in fields such as drug development, synthetic biology, new materials, food, agriculture, and environmental protection. A scenario where AI protein large models serve as the underlying technological support, thereby propelling the thriving of the biomanufacturing industry, is already within sight.

It is worth mentioning that in September 2024, MoleculeAI announced the completion of its Series A funding round, raising hundreds of millions of yuan led by SinoNova Capital and Shenzhen Capital Group Venture Capital, with follow-on investments from SenseTime Capital and Jiuyi Capital. As of this funding round, MoleculeAI has completed a total of three funding rounds, with past investors including leading synthetic biology company Cathay Industrial Biotech, Sequoia Capital China, Baidu Ventures, and Lenovo Capital. At this point, MoleculeAI can be said to have grown into an AI biomacromolecule design platform company with industry benchmark status, opening a new chapter in China's AI biological infrastructure construction.

Professor Xu Jinbo stated that this round of funding will be used to further expand the top-tier interdisciplinary technical and industrial talent team, advance the construction of high-performance computing platforms and intelligent high-throughput wet laboratories, and delve deeper into the construction of biological economic infrastructure such as the AI protein foundational large model and the AI protein optimization design platform MoleculeOS, further promoting the industrial application and commercial development of AI protein technology.

With the boost of large models, the stars in the depths of the AI protein universe are increasingly within reach.

In the latter half of the 20th century, people witnessed the rapid advancements in biotechnology, represented by genetic technology, and the resulting improvements in medical and healthcare, agriculture, and livestock production. Entering the 21st century, a new generation of biotechnology emerged, represented by synthetic biology and AI protein folding. A new technological pathway that does not rely on fossil fuels but instead drives social development through biomanufacturing and bioproducts has captured the imagination of human society.

In a research report published in 2020, the McKinsey Global Institute noted that 60% of material products in global economic activities could be produced through biotechnology, potentially involving a market size of up to $4 trillion. Faced with such enormous economic value and the depletion of fossil fuels and worsening environmental pollution, countries around the world have embarked on top-level design and forward-looking layouts in the innovative application of the bioeconomy and biotechnology, hoping to seize the initiative in the great transformation of the bioeconomy era.

Currently, more than 60 countries or regions, including China, the United States, Japan, and the European Union, have formulated specific policies for biomanufacturing or the bioeconomy, updating national and regional bioeconomic development strategies, and formulating biomanufacturing development roadmaps and action plans.

In particular, in 2022, the White House launched the National Biotechnology and Biomanufacturing Initiative in the United States and released a timeline for "Biotechnology and Biomanufacturing Goals" in 2023, establishing the National Bioeconomy Board to significantly accelerate the speed, success rate, and innovation efficiency of biomanufacturing and address issues unsolvable by biological experimentation. In March 2024, the European Commission released a policy paper titled "Building Our Future with Nature: Advancing Biotechnology and Biomanufacturing," proposing a series of targeted measures such as effectively utilizing research results and promoting innovation, stimulating market demand, simplifying regulatory pathways, encouraging public and private investment, developing and updating standards, and conducting international cooperation to promote the development of biotechnology and biomanufacturing in the EU. In May 2024, the Japanese government proposed achieving a bioeconomy market size of 100 trillion yen by 2030. In terms of biomanufacturing, it will promote the establishment of a microbial and cellular design platform integrating biotechnology and digital technologies such as AI, as well as improve infrastructure such as biofactories. In China, the "14th Five-Year Plan for the Development of the Bioeconomy" was issued for the first time in 2022, specifically targeting the bioeconomy and clarifying biomanufacturing as the strategic emerging industry direction for the bioeconomy. In 2024, "biomanufacturing" was included in the government work report of the two sessions for the first time as a new growth engine.

Against this backdrop, AI protein folding holds immense significance, acting as a pivotal point that influences various aspects. This technology integrates the rapid development of AI technology with the significant value of the bioeconomy. By combining the strengths of both, it achieves unprecedented accomplishments.

In the field related to AI proteins, Isomorphic Labs, which builds on DeepMind's innovative research findings, is strategically partnering with Novartis and Eli Lilly in AI drug development. AI-driven protein design company Generate Biomedicines received a $1.9 billion investment from biotechnology giant Amgen to develop protein therapies. Ginkgo Bioworks, a representative enterprise in the field of synthetic biology, is collaborating with Google Cloud to develop new large language models for applications such as drug discovery and biosafety, and with the Defense Advanced Research Projects Agency (DARPA) of the United States to explore how to utilize cell-free protein synthesis (CFPS) technology for on-demand protein production. NVIDIA invested in nine startups in 2023 that apply generative AI in drug research and development... The involvement of capital, technology, and application forces will accelerate the further development of AI protein technology and bring faster and larger-scale application implementations.

Standing at the timeline of 2024, it is certain that the life universe route unlocked by AI protein folding will soon bring more colorful radiance to the bioeconomy and human health undertakings.

Solemnly declare: the copyright of this article belongs to the original author. The reprinted article is only for the purpose of spreading more information. If the author's information is marked incorrectly, please contact us immediately to modify or delete it. Thank you.