PolyAMiner-Bulk: A new deep-learning tool developed to analyze RNA dynamics on a large scale


A team of computational biologists led by Dr. Hari Krishna Yalamanchili, an assistant professor at Baylor College of Medicine and an investigator at the Jan and Dan Duncan Neurological Research Institute (Duncan NRI) at Texas Children's Hospital, has developed a cutting-edge deep-learning algorithm called PolyAMiner to decode alternative polyadenylation status of messenger RNAs (mRNAs) from a variety of bulk RNA-sequencing datasets. The study was published in Cell Reports Methods.

According to the central dogma of molecular biology, DNA in a gene is transcribed to mRNA and then translated to protein. Alternative polyadenylation (APA) is a post-transcriptional regulatory mechanism that produces multiple mRNA molecules of different lengths from the same gene. There is increasing recognition of the pivotal role APA plays in regulating gene expression in fundamental cellular processes and how its misregulation results in several human diseases including neurodegenerative disorders and cancer.

PolyAMiner-Bulk is an extension of a previous software called PolyA-miner that was developed by Dr. Yalamanchili in 2020 to analyze specialized sequencing datasets that were specifically designed to identify alternative mRNA isoforms (e.g. PAC-Seq, PAS-Seq, 3’READS).

“Since its launch, PolyA-miner has proved extremely useful in extracting information about RNA dynamics from APA-specific datasets. However, we realized these newer APA-aware datasets only represent a fraction of all the currently available transcriptomic data, most of which is bulk RNA sequencing data,” Dr. Yalamanchili said. “We realized there was an urgent need to extend PolyAMiner technology to leverage existing bulk RNA-seq datasets to decipher APA dynamics accurately and precisely.”

The new tool is freely available and will allow RNA researchers around the world to study alternative polyadenylation in large datasets that are generated by data consortiums using bulk RNA sequencing technology. Some examples of large datasets that can be analyzed with this new tool are - the Religious Orders Study/Memory and Aging Project (ROSMAP) which contains extensive bulk RNA-sequencing data of the human frontal cortex for aging and Alzheimer’s Disease; The Cancer Genome Atlas (TCGA) which contains over 11,000 samples from control and primary cancer disease populations spanning 33 cancer types, and the Answer ALS data portal, which contains over 1,200 samples from control and neurodegenerative disease populations.

This is the first tool with an attention-based machine-learning architecture to identify different mRNA ends. Attention-based models are much more sophisticated than other machine-learning models. For instance, PolyAMiner-Bulk does not rely on the presence of predetermined motifs; instead, it models RNA as a language by capturing the hidden grammar and semantic dependency between multiple RNA sequence features.

“PolyAMiner-Bulk is a significant leap forward in alternative polyadenylation research, holding the potential to broaden our understanding of RNA dynamics and their implications in various diseases,” Dr. Yalamanchili said.


Others involved in the study were Venkata Soumith Jonnakuti, Eric J. Wagner, Mirjana Maletić-Savatić, and Zhandong Liu. They are affiliated with one or more of the following institutions: the Jan and Dan Duncan Neurological Research Institute at Texas Children's Hospital, Baylor College of Medicine, University of Rochester School of Medicine and Dentistry, and the USDA/ARS Children’s Nutrition Research Center at Baylor College of Medicine. The study was supported by the United States Department of Agriculture (USDA/ARS), the NRI Zoghbi Scholar Award, the Gulf Coast Consortia, and the National Library of Medicine Training Program in Biomedical Informatics and Data Science.