Over the past few months, it’s become clear that AI can be trained to imitate human language—just look at ChatGPT. And now, research shows that if trained adequately, similar language models can imitate human biology and evolution, and even put its own spin on it.
In a study, which was published on Thursday in Nature Biotechnology, researchers tested the ability of a language model (Salesforce’s ProGen) to generate amino acid sequences—enzymes—that could potentially work in real life scenarios. The project was a collaboration of many different parties, including Salesforce Research and researchers at University of California-San Francisco and University of California-Berkeley
Videos by VICE
But why use a language model—something that’s been used to generate essays and articles, for example—to generate biology? Proteins can be represented as a language made up of amino acids, the 20 molecules that make up every protein.
“In the same way that words are strung together one-by-one to form text sentences, amino acids are strung together one-by-one to make proteins,” Director of AI Research at Salesforce Research Nikhil Naik wrote in an email to Motherboard. “Building on this insight, we apply neural language modeling to proteins for generating realistic, yet novel protein sequences.”
Basically, instead of learning the language of English, the team developed AI to learn the language of proteins, explained Ali Madani Ph.D, a former scientist at Salesforce Research involved with the study wrote in an email to Motherboard.
Like other AI programs, the model had to be taught accordingly. ProGen was first trained on 280 million proteins. After two weeks, the team fine tuned the model by introducing it to a dataset of about 56,000 proteins from five different families. The model then generated one million artificial sequences. The team focused on 100 proteins to see how they compared to natural proteins, and whether or not they had adequately followed the so-called “grammar” of amino acid composition.
Of those 100 proteins, the team created five of the artificial proteins and tested their functionality in cells, seeing how well they compared to an enzyme found in chicken eggs aptly named “hen egg white lysozyme” (HEWL). Two of the proteins demonstrated activity similar to HEWL, breaking down bacteria’s cell walls.
“The enzymes work (out-of-the-box) as well as proteins that have evolved over millions of years of evolution,” Madani said. The team also found that the model was able to capture evolutionary patterns, without specifically being trained to do so.
While AI has been used to generate proteins, this study differs a bit from prior research and further expands the idea of what is possible with language models.
“Our work uses conditional language models that allow for significantly more control over what types of sequences are generated, making them more useful for designing proteins with specific properties,” Naik wrote. “We have also validated our results in a wet lab.”
The methods described in the paper are also available on GitHub to enable the research community to build on this work and accelerate research on AI for protein design. As Madani sees it, proteins are the workhorses of life.
“Everything that can go wrong or right in a human body is reliant on proteins, and so designing new ones can allow us to more effectively treat diseases or even avoid them in the first place,” Madani wrote.“We can use AI to engineer these solutions.