Orthrus: Towards Evolutionary and Functional RNA Foundation Models

In the face of rapidly accumulating genomic data, our understanding of the RNA regulatory code remains incomplete. Pre-trained genomic foundation models offer an avenue to adapt learned RNA representations to biological prediction tasks. However, existing genomic foundation models are trained using strategies borrowed from textual or visual domains, such as masked language modelling or next token prediction, that do not leverage biological domain knowledge. Here, we introduce Orthrus, a Mamba based RNA foundation model pre-trained using a novel self-supervised contrastive learning objective with biological augmentations. Orthrus is trained by maximizing embedding similarity between curated pairs of RNA transcripts, where pairs are formed from splice isoforms of 10 model organisms and transcripts from orthologous genes in 400+ mammalian species from the Zoonomia Project. This training objective results in a latent representation that clusters RNA sequences with functional and evolutionary similarities. We find that the generalized mature RNA isoform representations learned by Orthrus significantly outperform existing genomic foundation models on five mRNA property prediction tasks, and requires only a fraction of fine-tuning data.

Many recently released genomic foundation models (FMs) have illustrated the promise of self-supervised deep learning for DNA and RNA representation. Leveraging the ever-increasing amount of sequencing data and architectural advancements, genomic FMs aim to model the complex regulatory code governing the central dogma. Successful genomic FMs offer low cost, data-efficient prediction of many biological processes, unlocking potential advances in biological discovery and drug design.

Despite promising early results, we argue that the self-supervised training objectives commonly used by existing genomic FMs (i.e. next token prediction and masked language modelling) are ill-equipped to handle the unique characteristics of genomic sequences. Recent estimates have shown that 90%-95% of nucleotide positions are not under evolutionary constraint, meaning mutations to these nucleotides would not impact fitness. Pre-training a genomic FM by reconstructing these nucleotides essentially conditions it to predict noise, resulting in a poor representations for several relevant downstream tasks. In our work, we tackle this challenge, designing a biologically inspired self-supervised objective that we used to train a powerful new RNA foundation model.

We're excited to introduce Orthrus, an RNA foundation model trained on the Mamba architecture using a novel, biologically inspired contrastive learning objective that captures more than 870 million interactions between over 49 million unique RNA transcripts. Orthrus is pre-trained to associate RNA transcripts that are related through function and evolution. This biologically motivated representation allows Orthrus to perform strongly when tasked to predict the functional and regulatory properties of RNA. We find that fitting simple linear models on Orthrus embeddings yields performance that approaches task-specific supervised models. Orthrus is also highly data efficient, achieving high accuracy in comparison to supervised approaches when fine-tuning using as few as 45 data samples. Below, we provide an animated visualization of the Orthrus foundation model. The code and weights for Orthrus can be found at: [ Code ] [ Weights ].

Left: Visualization of RNA transcript embeddings throughout Orthrus training. Transcripts are randomly distributed in embedding space prior to training. The Orthrus loss function pulls transcripts related through splicing and orthology closer, while pushing unrelated transcripts apart. Right: Orthrus is trained using the DCL loss function.

Orthrus is trained using contrastive learning, which constructs a representation space for RNA transcripts by maximizing embedding similarity between an arbitrary transcript and a transcript randomly sampled from its set of transcripts related by function or evolution. The embedding similarity between all unrelated transcripts is simultaneously minimized. We construct our contrastive dataset with a simple hypothesis: RNA transcripts related by alternative splicing and orthologous processes should be closer in latent space than those that are not. To this end, we collect these positive pair interactions using splicing annotations from GENCODE and RefSeq, and ortholog alignments from the Zoonomia Project. In all, we obtain 870 million relationships that are used as positive pairs for contrastive learning. The Orthrus model is trained using the DCL loss function on the mean pooled representations of RNA transcripts that are encoded using the Mamba architecture.

The Orthrus foundation model is trained to accurately represent the properties of mature RNA transcripts in both regulatory and functional aspects. We evaluate Orthrus and other genomic FMs on the following prediction tasks:

RNA Half Life: This regression task measures the decay rate of mRNA in cells, and is an important cellular property due to its implications for protein expression regulation. We use the Agarwal and Kelley (2022) dataset for this benchmark, which consists of 10,432 human and 11,008 mouse RNA sequences with corresponding measurements.
Mean Ribosome Load: This task estimates the translational efficiency of a given mRNA molecule, measuring the number of ribosomes translating a single mRNA molecule at a point in time. Accurate MRL measurement offers insights into the efficiency of protein translation, a key process in cellular function. The dataset in question, derived from the HP5 workflow, captures this metric across 12,459 mRNA isoforms from 7,815 genes.
Protein Subcellular Localization: Protein function is often linked to its subcellular location, which can be determined using cells that are immunofluorescently stained. We downloaded a dataset of 10,409 genes, whose protein localization was determined by the Human Protein Atlas, and benchmark the ability to classify between the 12 most common localizations for the canonical isoform per gene, as determined by Appris database.
Protein GO Term Classification: Gene ontology (GO) terms are a hierarchical classification system used for assigning function to genes and their products. GO term hierarchical systems allow for fine-grained annotation of function, with broader terms at the top of the hierarchy and increased specificity closer to the bottom. To annotate genes with gene ontology terms, we subset GO classes three levels from the root, labeling all available genes.

We benchmark the performance of Orthrus against several genomic foundation models, as well as supervised task-specific models in both linear probing and fine tuning contexts. For linear probing, we simply train a linear model on top of fixed FM embeddings, while for fine-tuning we perform end-to-end training on the labelled prediction dataset. Overall, we see that Orthrus outperforms existing genomic FMs, while approaching the performance of supervised models designed specifically for RNA property prediction such as Saluki. In our experiments, by just fitting a linear model on trained Orthrus embeddings, Orthrus is able to achieve tuned DNN performance. Orthrus also shows strong data efficiency for fine-tuning, highlighting its potential to be used for extrapolating small experimental datasets.

Benchmarking linear probing performance on RNA property prediction tasks for self-supervised genomic foundation models. Individual bars represent the performance of foundation model variants, which typically differ in parameter count and pre-training dataset. Error bars show 95% confidence intervals, constructed using 10 runs with randomized data splits. The grey dashed line indicates the performance of the fully supervised Saluki method trained with access to labels.

We hope that Orthrus, trained with an alternative self-supervised objective, will inspire other researchers to further innovate on inductive biases that can be specifically helpful for genomic sequence modeling. Whether it is a unique, constraint-aware tokenization procedure or an alternative masking methodology, we believe there is ample space for innovation. A significant amount of information can be learned by looking at evolutionarily related sequences, which is our aim in this work. We believe that this loss can be combined with reconstruction-based losses such as MLM to further enhance the quality of the learned representations, improve RNA property prediction, and open up the door for generative capabilities. Improving RNA property prediction and representation quality has significant implications for therapeutic and diagnostic domains. We're only just beginning to uncover the role of mRNA in cellular function and its potential as a therapeutic modality.

BibTeX

@article{orthrus_fradkin_shi_2024,
  title = {Orthrus: Towards Evolutionary and Functional RNA Foundation Models},
  url = {http://dx.doi.org/10.1101/2024.10.10.617658},
  DOI = {10.1101/2024.10.10.617658},
  publisher = {Cold Spring Harbor Laboratory},
  author = {Fradkin,  Philip and Shi,  Ruian and Isaev,  Keren and Frey,  Brendan J and Morris,  Quaid and Lee,  Leo J and Wang,  Bo},
  year = {2024},
  month = oct 
}

Orthrus: Towards Evolutionary and Functional RNA Foundation Models

Abstract

Towards Biologically Inspired Genomic FMs

Orthrus: An RNA foundation model

Orthrus Contrastive Loss

Orthrus for RNA Property and Function Prediction

Select Additional Orthrus Results

2B. Plots evaluating the fine-tuning performance of Orthrus Base across varying data availability. Each dataset is subsampled to the indicated percentage, with the number of data points provided in brackets. Point estimates are plotted, averaged across three random seeds and random data splits.

Conclusions

Video Presentation

BibTeX