Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More
As the world delves deeper into the potential of GPT-4 and the competition with Claude 3.5 Sonnet, the AI research lab EvolutionaryScale, founded by former Meta engineers from the now-disbanded protein-folding team, is venturing into new territory: making biology programmable.
This task is challenging, but EvolutionaryScale, a year-old company, is already making significant strides. They recently launched ESM3, a versatile generative language model capable of following prompts to design new proteins. Remarkably, the model has generated a novel green fluorescent protein (esmGFP), a feat that would naturally take hundreds of millions of years.
“esmGFP has a sequence that is only 58% similar to the closest known fluorescent protein. Given the natural rate of GFP diversification, this is like simulating over 500 million years of evolution,” the company shared in a pre-print paper on their website.
Additionally, the startup announced it has secured $142 million in seed funding, led by investors Nat Friedman, Daniel Gross, and Lux Capital, with participation from Amazon and Nvidia’s venture capital arm. The smallest version of their model was also open-sourced to spur further research.
However, building ESM3 is just the beginning. Its real-world impact remains to be seen.
Why EvolutionaryScale is Targeting Biology with AI
Generative AI models have advanced significantly in understanding and reasoning with human language, prompting curiosity about whether these models can interpret the fundamental language of life and create new molecules. The core molecules—RNA, proteins, and DNA—evolved over 3.5 billion years through natural processes. Programming biology to design new molecules could help tackle major challenges like climate change, plastic pollution, and diseases such as cancer.
Organizations like Google Deepmind and Isomorphic Labs are already exploring this area, and now EvolutionaryScale has joined the effort. Founded in 2023, the company has developed several protein language models, but their latest, ESM3, is the largest and most versatile.
ESM3, a frontier generative model for biology, was trained using 1 trillion teraflops of computing power on 2.78 billion natural proteins from diverse organisms and biomes and 771 billion unique tokens. The model can analyze protein sequence, structure, and function, represented as discrete tokens in its input and output. Users can provide partial inputs across these tracks, and the model generates novel proteins as output.
“ESM3’s multimodal reasoning allows scientists to create new proteins with unprecedented control. For instance, it can combine structure, sequence, and function to suggest a possible structure for PETase, an enzyme that breaks down plastic waste,” explained the company.
In one instance, the model was prompted to design a new green fluorescent protein, a rare protein that marks other proteins with its fluorescence. The lab found that the generated protein has similar brightness to natural fluorescent proteins, a process that naturally would take 500 million years.
EvolutionaryScale’s team also noted that ESM3 can self-improve, using feedback from lab experiments or existing experimental data to refine its outputs.
Impact Remains to be Seen
ESM3 comes in small, medium, and large sizes. The smallest model, with 1.4 billion parameters, is open-sourced on GitHub for non-commercial use. Meanwhile, the medium and large versions, with up to 98 billion parameters, are available commercially through EvolutionaryScale’s API and platforms from partners like Nvidia and AWS.
EvolutionaryScale hopes researchers will use this technology to address some of the world’s biggest problems, benefiting human health and society. While its broader applications are still uncertain, pharmaceutical companies might be the most significant beneficiaries, using the technology to develop new medicines for severe conditions.
Previous models from the company have been used to enhance antibody characteristics and detect COVID-19 variants that pose significant public health risks.