Submitted Abstract
Directed evolution (DE) has enabled biochemists to create new enzymes for a variety of useful applications. This revolutionary method can however only be employed once initial activity of an enzyme toward a substrate has been established, creating a considerable bottleneck for engineering novel, useful enzyme properties. During our engineering efforts, we only pick the sequences with the best beneficial outcome, discarding a huge amount of data. I propose to exploit this enormous wealth of untapped information using machine learning (ML). With this approach, I will predict and identify new sequences that lead to a variety of useful traits, mitigating that key initial hurdle in enzyme engineering. Creating a novel way to make significant, data-driven advances in enzyme discovery is a crucial step for next-generation enzyme engineering. This project will develop ML strategies that make use of rapidly growing enzyme databases, in particular the vast number of databases on cytochromes’ P450 role in drug metabolism. The proposed ML model combined with directed evolution techniques, pioneered in the Arnold lab, will allow me to develop enzymes much more rapidly than state-of-art techniques and produce a wide range of substrates critical for evaluating pharmacological and toxicological properties of drugs. Further applications of the ML model are envisioned to impact pharmaceutical sciences, chemistry, agriculture, and many more fields. This project will eventually create a publicly available database of enzyme sequences that can be matched to substrates and reactions. Such a feat will greatly expand the synthetic scope of biocatalysis, a sustainable and environmentally friendly route to making chemicals.