PhD in the Shifting Sands of 2020s Computational Biology

25 Apr 2026

I

A PhD committee meeting is sort of like a roast session with the best intentions, making sure that the PhD students research project is on track. Halfway through my PhD, I was presenting to my committee on a foundation model for predicting RNA properties. One of my committee members, who is a computational biology professor, started quizzing me on mechanisms behind mRNA decay. I could answer parts of it (I recently read up on the role of m6a and exon junction complexes).

As he kept grilling me, I realized that we were talking about different approaches and ways of thinking, but still ostensibly both studying computational biology. He comes from a research tradition that values understanding biological mechanisms, validating against known biology, and diving deep into narratives around observational and experimental data. I was interested in the same problem from a different direction that embraces the contemporary paradigm of deep learning. Using RNA half life as an example: could you learn enough about RNA from unlabeled sequence data? So that a model could predict properties like half-life, especially when there wasn’t much labeled experimental data to work with. We were both studying computational biology, even similar molecular mechanisms, but our philosophies of approaching the problem were completely different. As a PhD student you’re typically at the mercy of the committee and I didn’t have much leeway to push back. For unrelated reasons, he stepped off the committee and I was able to continue the foray into self supervised learning for genomics but it’s funny to think my work on self-supervised modeling could have ended there.

The Vector crew

II

During my PhD I was able to witness two paradigm shifts: (1) foundation models arriving in biology, and (2) use of LLMs as research tools. Thomas Kuhn wrote about this in The Structure of Scientific Revolutions, how scientific discovery alternates between “normal science” and revolutionary periods where the rules get rewritten. During these revolutionary breaks, there’s a lot of confusion about what constitutes good work. What counts as a valid question? How do you design good experiments to reject the hypothesis? Retrospectively these trends get rewritten as natural progressions in terms of evolution of the field and it makes sense. But when you’re a graduate student on the inside, you strongly experience the different incentives: be innovative enough to matter in the eyes of the scientific community, but rigorous and thorough enough to survive the scrutiny. The past five years of my graduate school have been defined by this.

The first revolutionary period of foundation modeling consisted largely of people copying techniques from computer vision (CV) and natural language processing (NLP) into genomics. Honestly, it makes a lot of sense, and there was a lot of excitement in the community following the wild and manic success of these models in NLP. A lot of the thinking was: if transformers could learn language, maybe we could plug what worked and directly adopt it to finally decipher DNA and other discrete biological “language”. The early work consisted of porting architectures over without rethinking whether the assumptions translated. The first year of my PhD, I trained a BERT model on the reference genome, and then, feeling unsatisfied and seeing RoBERTa come out, trained that too. Engaging in an ML researcher’s favorite activity, over a two-week stretch I would refresh weights and biases and watch the loss go down. At the end of training I was left with a checkpoint, and I realized I wasn’t actually sure what to do with it. I’m grateful for this early lesson: it really shaped my thinking on how science should be done. In the absence of an ImageNet equivalent for genomics, it wasn’t clear how to validate whether these models are actually learning.

My conclusion from this whole shebang was that to make new methods stick, you have to ground them in what already works. If you can show that your approach improves on a contribution people already trust, you’re speaking their language and can win over their trust. Trust is the true scientific currency.

III

I had a hunch that the thing that would really move the needle wasn’t which architecture to use, although that turned out to matter too with S4 and Mamba handling long sequences well. It was that there was a mismatch between the training objectives people were using and the actual biology. This came from working with variant effect data: ClinVar, massively parallel reporter assays. A lot of positions in the genome just weren’t that sensitive to sequence changes, and you could see an accumulation of non-pathogenic mutations across the reference genome. I started searching for different losses to play around with that can make use of the things that makes self-supervised learning actually work: diverse, large, quality data.

I came from a splicing lab. Most people who think about splicing think about disease: about the cases where it goes wrong. Something like 15% of genetic diseases are caused by missplicing, and most of those mutations hit the core splice site dinucleotide. But if you think about splicing as a diversity-generating mechanism instead, you’d expect it to preserve the functional properties of the underlying sequence. I was excited about contrastive learning at the time, and this seemed like exactly the right setting for it.

Nowadays this view is more accepted, with people thinking about splicing could be a way to regulate gene expression. But at the time, it wasn’t how most splicing people thought, and I got pushback early on. My partner was doing her PhD at Columbia at the time, in a lab that is more exposed to wet lab biology. I was visiting for a couple of weeks since we were long distance. Over lunch, I was telling one of the postdocs about Orthrus, and being a little too facetious, I said something like the whole idea was that splicing isn’t actually that important. She looked at me and said, “Surely you can’t mean that splicing is irrelevant.” I got a little stressed as she was welcoming me into her lab, and I wanted to be respectful of her research. And in reality, I actually agreed with her: splicing is clearly critical in contexts, like development and neuronal tissues. My actual argument regarding model’s inductive bias was a bit more nuanced: that treating isoforms as functionally similar turns out to be an extremely effective training signal, even if it’s not literally true in every case. To restore my reputation, I went out and got a cake with a custom engraving that said “I love splicing.” I’m glad we were able to mend the misunderstanding, and we’ve actually talked about collaborating since then.

The I love splicing cake

IV

I did my PhD at the University of Toronto and the Vector Institute. One of the best things about Vector is that it’s extremely collaborative. You get to hang out with people from all sorts of different subfields: information theory, optimization, computer vision, all of which turn out to be relevant in weird ways to biology.

As language models were getting better, around 2024 - 2025, our lunchtime conversations about devaluation of intellectual labor took on a more anxious tone. This would usually end up with us sitting in the lunchroom, slightly spiraling, sharing stories of how we were using LLMs for our research. On Friday nights we’d go to the local pub (shout out to Ronnie’s) and talk about whether it even makes sense to do a PhD anymore.

Friday nights at Ronnie's

Five years is a long time to become an expert in something when an LLM can extract the most relevant and latest ideas from papers and catch you up to the frontier of a field without necessarily having to put in all that time. Of course it’s not the knowledge that you gain during a PhD but the ability to push the boundary of what is known. Still in the professional world you often get paid for your understanding of a particular subject, which is exactly where LLMs are helpful.

The thing that I’m excited about now is LLMs feel like an incredible tool for speeding up the process of scientific research. Looking back, I’m not sure how I got anything done before these tools existed. And working at the intersection of fields right now feels so powerful, because these tools let you pull in ideas from everywhere and prototype so so fast.

In biology specifically, language models are missing something fundamental. They have a very weak conception of things like proteins, genetic regions, transcription factors, small molecules, because language is just an inherently poor way to describe them. You can write about a protein’s function, but you can’t capture its shape or its binding dynamics in a sentence. The other piece is that so much of biology is experimental, and a lot of the knowledge about what actually works in the lab is tribal, passed down within labs. For example which assays are reliable, how to get a protocol to succeed, these little tricks that never make it into a methods section. Self-driving labs might eventually solve that challenge, but for now, LLMs are useful in biology while also being clearly incomplete.

V

During periods of upheaval good scientists need to do two things. First, suspend criticism long enough to understand what a new paradigm enables and what the vision is for making it useful. Second, understand the strengths of the existing systems well enough to know how to integrate the old and the new. Most people tend to only do one. Some hold allegiance to the old paradigm and call out the shoddy science, which is often fair, but miss the real benefits that come along with it. Others just apply the latest techniques without thinking deeply about the domain they’re working in. These disagreements regularly spill onto Twitter, where they turn into flame wars that are honestly very entertaining to watch but probably counterproductive and fracturing to the scientific community.

The two paradigm shifts I lived through are starting to converge. On one side, there are foundation models for specialized domains that can reason about biological sequences and structures directly. On the other, there are LLM agents that can peruse all of the published research and make scientific recommendations. I’m excited about integrating the two: giving language model agents a way to reason about biological entities themselves, not just the text we write about them. That would mean direct improvements in foundation models translate into agents that can better contextualize scientific protocols and better interpret the results of experiments. That’s the future I want to build toward.

← Philip Fradkin