PhD in the Shifting Sands of 2020s Computational Biology

2026-04-25T00:00:00+00:00

I

A PhD committee meeting is sort of like a roast session with the best intentions, making sure that the PhD students research project is on track. Halfway through my PhD, I was presenting to my committee on a foundation model for predicting RNA properties. One of my committee members, who is a computational biology professor, started quizzing me on mechanisms behind mRNA decay. I could answer parts of it (I recently read up on the role of m6a and exon junction complexes).

As he kept grilling me, I realized that we were talking about different approaches and ways of thinking, but still ostensibly both studying computational biology. He comes from a research tradition that values understanding biological mechanisms, validating against known biology, and diving deep into narratives around observational and experimental data. I was interested in the same problem from a different direction that embraces the contemporary paradigm of deep learning. Using RNA half life as an example: could you learn enough about RNA from unlabeled sequence data? So that a model could predict properties like half-life, especially when there wasn’t much labeled experimental data to work with. We were both studying computational biology, even similar molecular mechanisms, but our philosophies of approaching the problem were completely different. As a PhD student you’re typically at the mercy of the committee and I didn’t have much leeway to push back. For unrelated reasons, he stepped off the committee and I was able to continue the foray into self supervised learning for genomics but it’s funny to think my work on self-supervised modeling could have ended there.

II

During my PhD I was able to witness two paradigm shifts: (1) foundation models arriving in biology, and (2) use of LLMs as research tools. Thomas Kuhn wrote about this in The Structure of Scientific Revolutions, how scientific discovery alternates between “normal science” and revolutionary periods where the rules get rewritten. During these revolutionary breaks, there’s a lot of confusion about what constitutes good work. What counts as a valid question? How do you design good experiments to reject the hypothesis? Retrospectively these trends get rewritten as natural progressions in terms of evolution of the field and it makes sense. But when you’re a graduate student on the inside, you strongly experience the different incentives: be innovative enough to matter in the eyes of the scientific community, but rigorous and thorough enough to survive the scrutiny. The past five years of my graduate school have been defined by this.

The first revolutionary period of foundation modeling consisted largely of people copying techniques from computer vision (CV) and natural language processing (NLP) into genomics. Honestly, it makes a lot of sense, and there was a lot of excitement in the community following the wild and manic success of these models in NLP. A lot of the thinking was: if transformers could learn language, maybe we could plug what worked and directly adopt it to finally decipher DNA and other discrete biological “language”. The early work consisted of porting architectures over without rethinking whether the assumptions translated. The first year of my PhD, I trained a BERT model on the reference genome, and then, feeling unsatisfied and seeing RoBERTa come out, trained that too. Engaging in an ML researcher’s favorite activity, over a two-week stretch I would refresh weights and biases and watch the loss go down. At the end of training I was left with a checkpoint, and I realized I wasn’t actually sure what to do with it. I’m grateful for this early lesson: it really shaped my thinking on how science should be done. In the absence of an ImageNet equivalent for genomics, it wasn’t clear how to validate whether these models are actually learning.

My conclusion from this whole shebang was that to make new methods stick, you have to ground them in what already works. If you can show that your approach improves on a contribution people already trust, you’re speaking their language and can win over their trust. Trust is the true scientific currency.

III

I had a hunch that the thing that would really move the needle wasn’t which architecture to use, although that turned out to matter too with S4 and Mamba handling long sequences well. It was that there was a mismatch between the training objectives people were using and the actual biology. This came from working with variant effect data: ClinVar, massively parallel reporter assays. A lot of positions in the genome just weren’t that sensitive to sequence changes, and you could see an accumulation of non-pathogenic mutations across the reference genome. I started searching for different losses to play around with that can make use of the things that makes self-supervised learning actually work: diverse, large, quality data.

I came from a splicing lab. Most people who think about splicing think about disease: about the cases where it goes wrong. Something like 15% of genetic diseases are caused by missplicing, and most of those mutations hit the core splice site dinucleotide. But if you think about splicing as a diversity-generating mechanism instead, you’d expect it to preserve the functional properties of the underlying sequence. I was excited about contrastive learning at the time, and this seemed like exactly the right setting for it.

Nowadays this view is more accepted, with people thinking about splicing could be a way to regulate gene expression. But at the time, it wasn’t how most splicing people thought, and I got pushback early on. My partner was doing her PhD at Columbia at the time, in a lab that is more exposed to wet lab biology. I was visiting for a couple of weeks since we were long distance. Over lunch, I was telling one of the postdocs about Orthrus, and being a little too facetious, I said something like the whole idea was that splicing isn’t actually that important. She looked at me and said, “Surely you can’t mean that splicing is irrelevant.” I got a little stressed as she was welcoming me into her lab, and I wanted to be respectful of her research. And in reality, I actually agreed with her: splicing is clearly critical in contexts, like development and neuronal tissues. My actual argument regarding model’s inductive bias was a bit more nuanced: that treating isoforms as functionally similar turns out to be an extremely effective training signal, even if it’s not literally true in every case. To restore my reputation, I went out and got a cake with a custom engraving that said “I love splicing.” I’m glad we were able to mend the misunderstanding, and we’ve actually talked about collaborating since then.

IV

I did my PhD at the University of Toronto and the Vector Institute. One of the best things about Vector is that it’s extremely collaborative. You get to hang out with people from all sorts of different subfields: information theory, optimization, computer vision, all of which turn out to be relevant in weird ways to biology.

As language models were getting better, around 2024 - 2025, our lunchtime conversations about devaluation of intellectual labor took on a more anxious tone. This would usually end up with us sitting in the lunchroom, slightly spiraling, sharing stories of how we were using LLMs for our research. On Friday nights we’d go to the local pub (shout out to Ronnie’s) and talk about whether it even makes sense to do a PhD anymore.

Five years is a long time to become an expert in something when an LLM can extract the most relevant and latest ideas from papers and catch you up to the frontier of a field without necessarily having to put in all that time. Of course it’s not the knowledge that you gain during a PhD but the ability to push the boundary of what is known. Still in the professional world you often get paid for your understanding of a particular subject, which is exactly where LLMs are helpful.

The thing that I’m excited about now is LLMs feel like an incredible tool for speeding up the process of scientific research. Looking back, I’m not sure how I got anything done before these tools existed. And working at the intersection of fields right now feels so powerful, because these tools let you pull in ideas from everywhere and prototype so so fast.

In biology specifically, language models are missing something fundamental. They have a very weak conception of things like proteins, genetic regions, transcription factors, small molecules, because language is just an inherently poor way to describe them. You can write about a protein’s function, but you can’t capture its shape or its binding dynamics in a sentence. The other piece is that so much of biology is experimental, and a lot of the knowledge about what actually works in the lab is tribal, passed down within labs. For example which assays are reliable, how to get a protocol to succeed, these little tricks that never make it into a methods section. Self-driving labs might eventually solve that challenge, but for now, LLMs are useful in biology while also being clearly incomplete.

V

During periods of upheaval good scientists need to do two things. First, suspend criticism long enough to understand what a new paradigm enables and what the vision is for making it useful. Second, understand the strengths of the existing systems well enough to know how to integrate the old and the new. Most people tend to only do one. Some hold allegiance to the old paradigm and call out the shoddy science, which is often fair, but miss the real benefits that come along with it. Others just apply the latest techniques without thinking deeply about the domain they’re working in. These disagreements regularly spill onto Twitter, where they turn into flame wars that are honestly very entertaining to watch but probably counterproductive and fracturing to the scientific community.

The two paradigm shifts I lived through are starting to converge. On one side, there are foundation models for specialized domains that can reason about biological sequences and structures directly. On the other, there are LLM agents that can peruse all of the published research and make scientific recommendations. I’m excited about integrating the two: giving language model agents a way to reason about biological entities themselves, not just the text we write about them. That would mean direct improvements in foundation models translate into agents that can better contextualize scientific protocols and better interpret the results of experiments. That’s the future I want to build toward.

Writing

2020-10-30T00:00:00+00:00

Part of my reasoning for going to grad school is that it would give me the opportunity to work on my skills in an environment where there are outside constraints. In a work environment the main goal is to deliver value to the company. Sometimes incentives between the individual are aligned so you learn how to be a better scientist, communicator, or clear thinker. More often I found, even in a research role, the need to deliver immediate value forces you to look beyond your personal development. One of the things that I am interested in improving is communication. I would like to become a better oral and written communicator. As part of my goal to improve my written communication, I am aiming to start writing about my experiences, thoughts, and impressions of media.

I’m TAing a course this semester, I have now experienced first hand that to be confident in conceptual understanding you need to be able to explain it to someone. I’ve read multiple sources arguing that writing is a way of clearing up your own thinking as well. Writing seems to me like a more formal constrained way to communicate your ideas. This means that the delivery needs to be more structured and your word choice can be more precise. I can view writing as a way of prodding my ideas and forcing them into more definitive shape. This requires my conclusions to be more crisp and accurate. It means that in theory the would be able to generalize learnings to new areas more efficiently.

I’ve previously thought about how an individual can have the best brightest idea, but if they are unable to communicate it, it’s worthless. I wonder how much information, and insight is lost to humanity because people are unable to communicate their thoughts to society. Imagine everyone had a lossless, succinct way of delivering their ideas to society. Language feels like a very lossy technology to express oneself. How many times has it happened just to me when the meaning of the message gets lost through text?

Ignaz Semmelweis was the first person to suggest that washing hands prior to surgery can increase patient survival due to avoiding bacterial infection. This technology is revolutionary since it allows mortality decrease of patients from all surgeries, most importantly child birth. Infamously however, this technology took an incredibly long time to spread and be acknowledged by the scientific community. This is a tragedy since for every year that this practice was not accepted and known in the scientific circles, the price was human lives. In part the reason why this finding took so long to circulate is because Semmelweis was a poor communicator. He refused to publish his findings and was combative with the rest of the medical community. This is a tragic example of devaluing communication as an key instrument in the scientific toolbox. In this case it has cost lives of many people. While I don’t think in most cases the outcome would be as horrendous as Ignaz Semmelweis’, boring talks and dry publications deter progress.

This writing experiment is mostly aimed at myself. I want to become a better communicator and as part of that I’d like to keep myself accountable by having these notes be part of the public domain. It will also be an interesting experience to reflect on changes of my writing through time.

Philip Fradkin

PhD in the Shifting Sands of 2020s Computational Biology

I

II

III

IV

V

Writing