ZymCTRL is the world's first open-source, text-based enzyme
generation model and can be used across multiple industries,
including therapeutics and sustainability initiatives.
AI model able to produce sequences that produced functional
enzymes with desirable characteristics for industrial
applications.
LONDON, June 18, 2024 /PRNewswire/ -- Basecamp Research,
a world leader in artificial intelligence (AI)-based design of
proteins and other biological systems, in partnership with the
Ferruz Laboratory at the Institute of Molecular Biology of
Barcelona today announced the
release of ZymCTRL ("enzyme control"), a ChatGPT-like tool that
generates new sequences from scratch based on a user simply
typing in an enzyme identification code, which specifies the
desired activity.
Large language models (LLMs), such as ChatGPT, have proven
useful in helping scientists design and generate protein
sequences.. However, current models require further training as
well as conditioning on a known protein starter sequence ("seed
sequence") for protein generation.
ZymCTRL is a next-generation end-to-end protein LLM that offers
rapid, cost-effective design capabilities for generating artificial
enzymes. In contrast to other LLMs, the tool requires no seed
sequence, giving end users complete control. Another important
feature is ZymCTRL's ability to create enzyme sequences that work
but share only 30% resemblance to those in the training set –
expanding the possibilities for designing new enzymes.
"With ZymCtrl, generating highly specific enzymes is as easy as
interacting with a chatbot", said Noelia Ferruz who has been
partnering with Basecamp Research for over 2 years. The Ferruz lab
is considered a pioneer in the field of AI for protein design,
having previously built ProtGPT2. a deep unsupervised language
model for protein design.
"Even before the release of ChatGPT, we began working on
large language models with Noelia because we think these models
represent the future of biological research and protein design,"
said Dr. Philipp Lorenz, CTO of
Basecamp Research. "We're deeply excited by these results and
ZymCTRL's ability to create functional enzymes that can solve some
of today's biggest challenges, from finding new ways to treat
devastating diseases to building greener and more sustainable
catalytic processes in bioindustry."
The open source ZymCTRL model has been independently reviewed by
academics in Structural Biology and ChemBioChem, peer-reviewed
scientific journals. In ChemBioChem, researchers at The Institute
of Biochemistry at Austria's
Graz University of Technology,
cited ZymCTRL's efficiency and ease of use. "ZymCtrl designs
putative enzyme variants on consumer GPUs within seconds and,
remarkably, it creates these sequences with only an EC number as
input," wrote Horst Lechner,
principle investigator for the institute, which is focuses on
enzyme design that differs from what's seen in nature.
Basecamp Research is sharing ZymCTRL open source with
researchers and sees an array of potential applications, including
designing enzymes for disease treatment and diagnostics, biofuel
production, sustainable agriculture innovations and much more.
While ZymCTRL was initially trained on publicly available
datasets, it can also be integrated with other datasets, including
Basecamp Research's proprietary BaseGraph database, to further
optimise the model and improve sequence outputs.
Highlights
- ZymCTRL was first trained on the BRENDA enzyme database,
comprising 37M enzyme sequences.
- From this, the team generated sets of carbonic anhydrases,
enzymes that accelerate the conversion of carbon dioxide to
bicarbonate, helping capture and store CO2, and lactate
dehydrogenases, enzymes that help convert sugar into energy in our
cells, with no further fine-tuning for the AI model.
After producing and purifying the proteins, several showed enzyme
activity despite less than 40% of their sequences resembling
proteins seen in the public database. This happened with no
additional adjustments to the model.
- To correct for potential biases in public databases, which have
uneven sampling due a lack of biodiversity, ZymCTRL was
adjusted using a wider range of lactate dehydrogenase sequences
from Basecamp Research's proprietary BaseGraph dataset.
- With this fine-tuning, the team created lactate dehydrogenases
with higher quality scores in silico (in computer
simulations), such as better predicted local distance difference
test (pLDDT) values, compared to sequences generated with no prior
training.
- Remarkably, active enzymes continued to show significant
activity at a high temperature of 45°C as well as across a broad pH
range of 4.5 to 9.5 – meaning it can work or stay stable in
slightly acidic to slightly basic environments – offering
significant industry advantages over naturally-occurring lactate
dehydrogenases. This excellent pH tolerance allows a single enzyme
to be used in many different processes with different pH levels,
making the enzyme very useful and adaptable for many
applications.
- Two of the artificial lactate dehydrogenase enzymes were
produced in larger amounts and successfully freeze-dried. They kept
their activity and showed they could work in complex reactions
under harsh conditions, supporting their potential for industrial
use.
"Beyond the obvious excitement of being able to generate truly
de novo proteins, the results are a further testament to the
ability of Basecamp Research's dataset to produce better results
compared to publicly available datasets, which barely scratch the
surface of the Earth's immense biodiversity," added Dr.
Glen Gowers, co-founder of Basecamp
Research. "Earlier we were able to show that our BaseFold model,
also powered by our dataset, outperformed AlphaFold2 in predicting
protein structures. Generative AI is going to have a huge impact
across biotech, and we're dedicated to collecting the data and
tools needed to make its potential a reality."
The full preprint can be found
here: https://www.biorxiv.org/content/10.1101/2024.05.03.592223v1
Basecamp Research invites the research community to try ZymCTRL
and has released it for public use on Hugging Face:
https://huggingface.co/AI4PD/ZymCTRL
For media and other inquiries, please contact
press@basecamp-research.com, +44 07867 488769
About Basecamp Research
Basecamp Research is a leader in mapping biodiversity for
AI-based design of biological systems. We match and refine novel
proteins for our partners' exact industrial, therapeutic or
diagnostic applications using BaseGraph™, a new generation of AI
design that is powered by the first-ever high-resolution map of
global genetic biodiversity.
Understanding the full genetic, evolutionary, and environmental
context of each protein allows Basecamp Research to design
tailored proteins for specific applications without
the need for expensive and time-consuming directed evolution
campaigns. We're a team of explorers, scientists and policy experts
driven by our ambition to protect and learn from nature's
diversity, whilst delivering life-changing breakthroughs to those
who need them most.
For more information, visit www.basecamp-research.com.
Photo -
https://mma.prnewswire.com/media/2439331/Basecamp_Research.jpg
Logo -
https://mma.prnewswire.com/media/2357382/4763840/Basecamp_Research_Logo.jpg
View original
content:https://www.prnewswire.co.uk/news-releases/basecamp-research-launches-zymctrl-a-world-first-open-source-generative-ai-tool-that-designs-enzymes-for-more-sustainable-industrial-processes-302174359.html