Identifying, Designing, and Recommending DNA/RNA Sequences Using Machine Learning
Introduction to AI in Synthetic Biology
Artificial Intelligence is implemented in various fields such as retail, security and surveillance, and healthcare. However, there is limited research on the applications of AI in synthetic biology. Synthetic biology is a field of study that uses engineering principles to develop new biological systems, devices, or redesign existing biological systems that are found in nature. To fully understand the applications of artificial intelligence in synthetic biology, it’s imperative to understand the various types of machine learning.
A Brief Explanation of Machine Learning / Deep Learning
Machine learning (ML) and deep learning (DL) are branches of Artificial Intelligence (AI) that extend the capability of a machine to imitate intelligent human behavior. For example, as humans, our brains now look at our phone and know that it is a phone immediately. Why? We have seen phones throughout our life, so we have learned to associate a physical object with a word. Similarly, computers “learn” a dataset multiple times in order to make accurate predictions and classifications. There are various types of machine learning which are applied for different use cases.
There are three main types of machine learning: supervised, unsupervised, and reinforcement.
- Supervised learning relies on the data that is labeled to predict the next outcome. The goal of the testing data is to measure how accurately the algorithm will perform on unlabeled data.
- Unsupervised learning identifies “clusters” within the data and classifies the data points accordingly. For example, on a graph, if there are 10 points that are at the top right of the graph and another group of 10 points that are near the bottom left of the graph, the program is able to differentiate between the different types of data even though the data points aren’t labeled.
- In reinforcement learning, the machine learning model learns from its mistakes in the past using trial and error. Reinforcement learning uses rewards and punishments as signals for positive and negative behavior within the model. Although reinforcement learning may sound similar to supervised and unsupervised learning, the main goal is to find a suitable model that will maximize the total reward of the agent.
Current Issues in Synthetic Biology & AI’s Intervention
One of the main concerns in synthetic biology is that it may unintentionally recreate known pathogenic viruses or produce toxic biochemicals. Due to this, it can be difficult to safely test synthetic biology systems before implementing them. Artificial intelligence can be used to design and optimize biological systems by identifying the most promising combinations of genetic components. By doing this, the AI algorithm is able to predict how the synthetic biology system will behave in various environments. Specifically, AI can be used to design the most optimal RNA sequence to fold into a specific structure — the machine learning algorithm will be able to design an RNA sequence to perform a specific biological function. This will be monumental in biomedical research as labs will be able to manipulate RNA sequences using safer and more accurate methods.
Gathering the Data
So how do labs get the data needed to model the synthetic biology systems? In order to design and model a synthetic biology system, labs need to know the 3D structure of the system itself in order to predict its behavior. There are two main ways in which a lab is able to predict the 3D in-silico structure for RNA, DNA, and proteins: Nuclear Magnetic Resonance spectroscopy (NMR) and X-ray crystallography. These two methods are crucial to computational synthetic biology models as they enable scientists to design and predict the behavior of molecular sequences and structures by modeling its structure to study its function.
The Basics of NMR
NMR utilizes the inherent magnetic properties of specific atomic nuclei to reveal the structure, concentration, and behavior of molecules in solid or liquid samples. The biological sample is placed in a bottle and placed in-between two magnets. Then, it is hit with electromagnetic pulses in the radio frequency (RF) range. The RF is specifically tuned to the nuclei that are being studied in the sample. The nuclei absorbs the energy from the electromagnetic pulse and goes to a higher energy state. After the pulse is over, the nucleus relaxes back to the lower energy state, releasing energy in this process. The NMR instrument repeats this process multiple times to get a strong signal and minimize external noise. Next, the NMR instrument performs a Fourier Transform (FT) on the signals to show the individual RF frequency that makes up the composite signal. The FT is a transform that converts a function into a form that describes the frequencies present in the original function. These frequencies make up the final NMR spectrum that is analyzed to reveal the behavior of the molecule.
However, the NMR spectrum of biomolecules with large molecular weight is very complicated and can be difficult to interpret. NMR also requires large amounts of pure samples to achieve an understandable signal to noise ratio.
The Basics of X-Ray Crystallography
X-ray crystallographic data can be used to reveal the structure of matter at the atomic level. This method works by exposing a crystallized sample of a molecule to x-rays, usually by using an x-ray camera. The resulting picture shows the pattern of the diffracted x-rays as they pass through the crystal. From this, scientists are able to visually map the 3D molecular structure, which is done using a computer program.
However, the main drawback to this method is that the sample must be crystallizable. Crystallization of biological macromolecules with a high molecular weight can be difficult; specifically, membrane proteins are more challenging to crystallize as they have an unfavorable solubilization level.
The Basics of Designing RNA and DNA
Designing DNA and RNA is the basis for creating synthetic tissues. These are the main steps to designing gRNA and DNA for synthetic systems.
Guide RNA (gRNA) is the one of the main types of RNA used for gene editing and synthetic biology. Guide RNAs are used to delete, insert, or alter the target RNA or DNA. There are three things that must be defined to design a gRNA:
- The target region or gene
- The version of Cas9 protein to be used for gene editing (including what PAM sequence(s) is recognized)
- The promoter that will be used for in vitro or in vivo expression of the gRNA
A promoter is a region of DNA before a gene where proteins (i.e. RNA polymerase and transcription factors) bind to initiate transcription of that gene.
The final step of transcription produces an RNA molecule that can be duplicated, altered, and implemented to create synthetic tissues and systems.
DNA is a double helix of two polymers that carries genetic instructions for the development, functioning, growth and reproduction for organisms and viruses. There are five main steps to designing DNA
- Define project goals
- Design the sequence
- Synthesize oligos (short, synthetic strands of DNA)
- Assemble oligos into linear fragments and larger constructs (if applicable)
- Verify and Test the sequence of the gene fragment or cloned product
Overall, designing RNA and DNA strands is the basis of synthetic biology and creating artificial proteins and tissues.
So How Does Machine Learning Utilize Synthetically Created DNA/RNA?
Deep reinforcement learning can be used to train a network to sequentially design an entire RNA sequence given a specified target structure. Once the specified target structure is input into the program, the reinforcement-learning based model can design the RNA sequence based on the structure and function of the target. Using machine learning to design RNA sequences is very beneficial as it allows scientists to view and test synthetic biological systems in-vitro before implementing them in a person. Deep learning models are able to perform all of the DNA and RNA design steps listed in the previous section based on the data from NMR and X-ray crystallography.
Case Study | Reinforcement Learning for RNA Design
In June 2018, researchers at Stanford University used reinforcement learning to train a machine learning model for RNA design. The main function of an RNA molecule is determined by the structure it folds into (which is determined by the nucleotides that contain it). Many of the current innovations regarding the use of artificial intelligence in synthetic biology focus on modeling RNA since it has a much greater structural flexibility and more diverse functions (similar to proteins). To design an RNA molecule to perform a specific function requires solving the folding issue for RNA: when given a target structure, the computational model should create an RNA sequence that folds into the target structure. The computational model was able to predict the correct sequence and behavior of the RNA molecule by using an in-silico structure prediction technique.
Conclusion + Further Thoughts
AI has the potential to be a disruptive and innovative solution in the field of synthetic biology. However, synthetic biology has become a popular topic in bioethics. One of the main ethical concerns in synthetic biology is that it may result in the creation of organisms which fall between living beings and machines. In the future, regulations may need to be put in place to determine the threshold at which a person should stop genetically editing a living organism. Despite these ethical concerns, synthetic biology can save the lives of millions of patients today and many more in the future.
Resources
- https://www.researchgate.net/publication/354960266_Machine_Learning_Techniques_for_Personalised_Medicine_Approaches_in_Immune-Mediated_Chronic_Inflammatory_Diseases_Applications_and_Challenges
- https://www.nanalysis.com/nmready-blog/2019/6/26/what-is-nmr-spectrography-and-how-does-it-work
- https://mp.bmj.com/content/53/1/8
- https://www.news-medical.net/life-sciences/What-is-X-ray-Crystallography.aspx
- https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5014588/
- https://openreview.net/pdf?id=ByfyHh05tQ
- https://www.azenta.com/blog/beginner-guide-artificial-dna-synthesis
- https://www.thermofisher.com/blog/behindthebench/what-is-an-oligo/
- https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006176