Fun to see talk of "a compiler for DNA"---I've been hoping for that for a long time.
I have to admit, at a _glance_ this feels like a promising idea with few results and lots of marketing. I'll try to be clear about my confusion, feel free to explain if I'm off base.
- There's not a lot of talk of your "ground truth" for evaluations. Are you using mRNABench?
- Has you mRNABench paper been peer reviewed? You linked a preprint. (I know paper submission can be touch or stressful, and it's a superficial metric to be judged on!)
- Do any of your results suggest that this foundation model might be any good on out of sequence mRNA sequences? If not, then is the (current) model supposed to predict properties of natural mRNA sequences rather than of synthetic mRNA sequences?
- Did a lot mRNA sequences have experimental verification of their predicted properties? At a quick glance, I see this 66 number in the paper---but I truly have no idea.
I'm super happy to praise both incremental progress and putting forth a vision, I just also want to have a clear understanding of the current state-of-the-art as well!
Hey yes, the ground truth for our evaluations is measured experimental data. Our models are benchmarked using mRNABench, which aggregates results from high-throughput wet lab experiments.
Our goal, however, is to move beyond predicting existing experimental outcomes. We intend to design novel sequences and validate their function in our own lab. At that stage, the functional success of the RNA we design will become the ground truth.
> peer reviewed?
Both mRNA bench and Orthrus are in submission (at a big ML conference and a big name journal) - unfortunately the academic systems move slow but we're working on getting them out there.
> synthetic mRNA sequences
I think you're asking on generalizing out of distribution to unnatural sequences. There are two ways that we do this: (1) There are these screens called Massively Parallel Reporter Assays (MPRAs) and we eval for example on https://pubmed.ncbi.nlm.nih.gov/31267113/
Here all the sequences are synthetic and randomly designed and we do observe generalization. Ultimately it depends on the problem that we're tackling: some tasks like gene therapy design require endogenous sequences.
(2) The other angle is variant effect prediction (VEP). It can be thought of as a counterfactual prediction problem where you ask the model whether a small change in the input predicts a large change in the output. This is a good example of the study (https://www.biorxiv.org/content/10.1101/2025.02.11.637758v2)
> experimental verification of their predicted properties
all our model evaluations are predictions of experimental results! The datasets we use are collections of wet lab measurements, so the model is constantly benchmarked against ground-truth biology.
The evaluation method involves fitting a linear probe on the model's learned embeddings to predict the experimental signal. This directly tests whether the model's learned representation of an RNA sequence contains a linear combination of features that can predict its measured biological properties.
Thanks for the feedback I understand the caution around pre-prints. We believe a self-supervised learning approach is well-suited for this problem because it allows the model to first learn patterns from millions of unlabeled sequences before being fine-tuned on specific, and often smaller, experimental datasets.
I am totally onboard with the premise (as a TechBio-adjacent person), and some of the approaches you're taking (focused domain-specific models like Orthrus, rather than massive foundation models like Evo2).
I'm curious about what your strategy is for data collection to fuel improved algorithmic design. Are you building out experimental capacity to generate datasets in house, or is that largely farmed out to partners?
Cool. Could we train a "potential oncoprotein" classifier on Orthrus embeddings? IMO self serve diagnosis and detection is a far larger market than synthesis.
There is a number of different technologies. Some of the big ones are:
- mRNA therapies:
These therapies deliver a synthetically created messenger RNA (mRNA) molecule, typically protected within a lipid nanoparticle (LNP), to a patient's cells. The cell's own machinery then uses this mRNA as a temporary blueprint to produce a specific protein.
The big example here is CAR-T therapy from Capstan which just got acquired for 2.1B. Their asset,CPTX2309 , is currently in Phase 1. Previously to do Car-T therapy you had to extract a patient's T-cells and genetically engineer them in a special facility. Now the mRNA gets delivered directly to the patient's t cells which significantly lowers the cost and technical hurdles.
- RNA interferences (RNAi):
Used for gene expression knockdown through natural cellular mechanisms for viral detection. The big example here is Alnylam with 5 approved therapies and a number in clinical trials.
- Antisense Oligonucleotides (ASOs):
Short single stranded RNA molecules that get delivered directly to the cell and target an existing mRNA. The big win here is Spinraza which is the first approved treatment for Spinal Muscular Atrophy (SMA) which previously didn't have a treatment. The Spinraza clinical trial (ENDEAR) was so effective that they deemed it unethical to continue it because the control arm wasn't receiving the treatment. Prior to Spinraza most patients would pass away prior to two years of age.
The other day I paired an article on pyroptosis caused by marine spongiibacter exopolysaccharide and an mRNA Cancer vaccine article. I started to just forward the article on bacterially-induced pyroptosis to the cancer vaccine researchers but stopped to ask an LLM whether the approaches shared common pathways or mechanisms of action and - fish my wish - they are somehow similar and I had asked a very important question that broaches a very active area of research.
How would your AI solution help with finding natural analogs of or alternatives to or foils of mRNA procedures?
Fun to see talk of "a compiler for DNA"---I've been hoping for that for a long time.
I have to admit, at a _glance_ this feels like a promising idea with few results and lots of marketing. I'll try to be clear about my confusion, feel free to explain if I'm off base.
- There's not a lot of talk of your "ground truth" for evaluations. Are you using mRNABench?
- Has you mRNABench paper been peer reviewed? You linked a preprint. (I know paper submission can be touch or stressful, and it's a superficial metric to be judged on!)
- Do any of your results suggest that this foundation model might be any good on out of sequence mRNA sequences? If not, then is the (current) model supposed to predict properties of natural mRNA sequences rather than of synthetic mRNA sequences?
- Did a lot mRNA sequences have experimental verification of their predicted properties? At a quick glance, I see this 66 number in the paper---but I truly have no idea.
I'm super happy to praise both incremental progress and putting forth a vision, I just also want to have a clear understanding of the current state-of-the-art as well!
> ground truth
Hey yes, the ground truth for our evaluations is measured experimental data. Our models are benchmarked using mRNABench, which aggregates results from high-throughput wet lab experiments.
Our goal, however, is to move beyond predicting existing experimental outcomes. We intend to design novel sequences and validate their function in our own lab. At that stage, the functional success of the RNA we design will become the ground truth.
> peer reviewed?
Both mRNA bench and Orthrus are in submission (at a big ML conference and a big name journal) - unfortunately the academic systems move slow but we're working on getting them out there.
> synthetic mRNA sequences
I think you're asking on generalizing out of distribution to unnatural sequences. There are two ways that we do this: (1) There are these screens called Massively Parallel Reporter Assays (MPRAs) and we eval for example on https://pubmed.ncbi.nlm.nih.gov/31267113/
Here all the sequences are synthetic and randomly designed and we do observe generalization. Ultimately it depends on the problem that we're tackling: some tasks like gene therapy design require endogenous sequences.
(2) The other angle is variant effect prediction (VEP). It can be thought of as a counterfactual prediction problem where you ask the model whether a small change in the input predicts a large change in the output. This is a good example of the study (https://www.biorxiv.org/content/10.1101/2025.02.11.637758v2)
> experimental verification of their predicted properties
all our model evaluations are predictions of experimental results! The datasets we use are collections of wet lab measurements, so the model is constantly benchmarked against ground-truth biology.
The evaluation method involves fitting a linear probe on the model's learned embeddings to predict the experimental signal. This directly tests whether the model's learned representation of an RNA sequence contains a linear combination of features that can predict its measured biological properties.
Thanks for the feedback I understand the caution around pre-prints. We believe a self-supervised learning approach is well-suited for this problem because it allows the model to first learn patterns from millions of unlabeled sequences before being fine-tuned on specific, and often smaller, experimental datasets.
I am totally onboard with the premise (as a TechBio-adjacent person), and some of the approaches you're taking (focused domain-specific models like Orthrus, rather than massive foundation models like Evo2).
I'm curious about what your strategy is for data collection to fuel improved algorithmic design. Are you building out experimental capacity to generate datasets in house, or is that largely farmed out to partners?
Cool. Could we train a "potential oncoprotein" classifier on Orthrus embeddings? IMO self serve diagnosis and detection is a far larger market than synthesis.
Maybe another application could be the ranking of candidate variants for cancer immunotherapy? As far as I know, lncRNAs are sometimes assessed.
How are the RNA sequences used? Are there any clinical trials running?
There is a number of different technologies. Some of the big ones are:
- mRNA therapies: These therapies deliver a synthetically created messenger RNA (mRNA) molecule, typically protected within a lipid nanoparticle (LNP), to a patient's cells. The cell's own machinery then uses this mRNA as a temporary blueprint to produce a specific protein.
The big example here is CAR-T therapy from Capstan which just got acquired for 2.1B. Their asset,CPTX2309 , is currently in Phase 1. Previously to do Car-T therapy you had to extract a patient's T-cells and genetically engineer them in a special facility. Now the mRNA gets delivered directly to the patient's t cells which significantly lowers the cost and technical hurdles.
- RNA interferences (RNAi): Used for gene expression knockdown through natural cellular mechanisms for viral detection. The big example here is Alnylam with 5 approved therapies and a number in clinical trials.
- Antisense Oligonucleotides (ASOs): Short single stranded RNA molecules that get delivered directly to the cell and target an existing mRNA. The big win here is Spinraza which is the first approved treatment for Spinal Muscular Atrophy (SMA) which previously didn't have a treatment. The Spinraza clinical trial (ENDEAR) was so effective that they deemed it unethical to continue it because the control arm wasn't receiving the treatment. Prior to Spinraza most patients would pass away prior to two years of age.
The other day I paired an article on pyroptosis caused by marine spongiibacter exopolysaccharide and an mRNA Cancer vaccine article. I started to just forward the article on bacterially-induced pyroptosis to the cancer vaccine researchers but stopped to ask an LLM whether the approaches shared common pathways or mechanisms of action and - fish my wish - they are somehow similar and I had asked a very important question that broaches a very active area of research.
How would your AI solution help with finding natural analogs of or alternatives to or foils of mRNA procedures?
Can EPS3.9 cause pyroptosis cause IFN-I cause epitope spreading for cancer treatment?
Re: "Sensitization of tumours to immunotherapy by boosting early type-I interferon responses enables epitope spreading" (2025) https://www.nature.com/articles/s41551-025-01380-1
How is this relevant to mRNA vaccines?:
"Ocean Sugar Makes Cancer Cells Explode" (2025) https://scitechdaily.com/ocean-sugar-makes-cancer-cells-expl... ... “A Novel Exopolysaccharide, Highly Prevalent in Marine Spongiibacter, Triggers Pyroptosis to Exhibit Potent Anticancer Effects” (2025) DOI: 10.1096/fj.202500412R https://faseb.onlinelibrary.wiley.com/doi/10.1096/fj.2025004...