It makes me wonder if you could use a similar technique (iframes, bframes or pframes) to get the diff of a "normal" WSI and then train on pattern recognition of those.
These different frames are used to reduce network transmission costs, but it feels similar to the context window if you squint at it as a throughput problem rather than a context window size problem.
It feels like there would be a lot of tools and codecs you could leverage here.
I've been thinking a bit more about better ways to build the tooling around it, I don't know much about video compression to be fully transparent but will read up on it.
I have been running into some problems with memory management here as each later frame needs to have a degree of context of the previous frames... (currently I just do something simple like pass in the previous frame and the first reference frame into context) maybe I can look into video compression and see if there is any inspiration there
I wonder if navigation plays a significant role in performance. If you just randomly select 15 frames (presumably with interesting pixels), will the model perform similarly well?
Thought about this too. I think there are two broad LLM capabilities here that are kind of currently tangled up in this eval:
1. Can an LLM navigate a slide effectively (i.e find all relevant regions of interest)?
2. Given a region of interest, can an LLM make the correct assessment?
I need to come up with a better test here in general but yep I'm thinking about this
How would a human classify the cancers? I assume the LLM training data does not include a whole bunch of cancer samples, so assumably there are some rules that it follows?
> While there exists several pathology-focused AI models
Would also be curious how the LLM compares to this and other approaches. What's the performance of the models trained specifically on this task, and random guessing, compared to the expert pathologist? Correct me if I'm wrong but this seems like the sort of task where being right 90% of the time is not good enough, so even if the LLM beats other approaches, it still needs to close the gap to human performance.
> What's the performance of the models trained specifically on this task, and random guessing, compared to the expert pathologist?
I should probably first clarify here, the disease classification tasks are about subtyping the type of cancer (i.e classifying a case as invasive ductal carcinoma of the breast) rather than just binary malignant/benign classification so random guessing is much more difficult and makes this model performance more impressive.
> Would also be curious how the LLM compares to this and other approaches.
There aren't a lot of public general purpose pathology benchmarks. There are some like (https://github.com/sinai-computational-pathology/SSL_tile_be...) but focus on just binary benign/malignant classification tasks and binary biomarker detection tasks.
I am currently working on self-hosting the available open-source models.
> this seems like the sort of task where being right 90% of the time is not good enough, so even if the LLM beats other approaches, it still needs to close the gap to human performance
Yep, your intuition is right here, and actually the expectation is probably closer to mid-high 90%, especially for FDA approval (and most AI tools position as co-pilots at the moment). There is obviously a long way to go, but what I find about interesting about this approach is that it allows LLMs to generalize across (1) a variety of tissue types and (2) pathology tasks such as IHC H-score scoring.
Nope, I just did some prompt engineering on ootb models. I thought about doing some fine-tuning on like Qwen but think that there is still more performance to be squeezed out with just prompts here.
Yeah I think one of the things that would be interesting is to see how well it generalizes across tasks. It seems like the existence of pathology foundation models means there is certainly a degree of generalizability (at least across tissues) but I am not too sure yet about generalizability across different modalities (there are some cool biomarker-prediction models though)
very cool, have you tried some of the newer segmenting models to see if they make a difference? I've seen some in the past two weeks that look really effective...I wonder if it could help out the RL environment
Fascinating stuff.
For some reason, this reminds me the way video encoders compress video:
https://en.wikipedia.org/wiki/Video_compression_picture_type...
It makes me wonder if you could use a similar technique (iframes, bframes or pframes) to get the diff of a "normal" WSI and then train on pattern recognition of those.
These different frames are used to reduce network transmission costs, but it feels similar to the context window if you squint at it as a throughput problem rather than a context window size problem.
It feels like there would be a lot of tools and codecs you could leverage here.
I've been thinking a bit more about better ways to build the tooling around it, I don't know much about video compression to be fully transparent but will read up on it.
I have been running into some problems with memory management here as each later frame needs to have a degree of context of the previous frames... (currently I just do something simple like pass in the previous frame and the first reference frame into context) maybe I can look into video compression and see if there is any inspiration there
I wonder if navigation plays a significant role in performance. If you just randomly select 15 frames (presumably with interesting pixels), will the model perform similarly well?
Thought about this too. I think there are two broad LLM capabilities here that are kind of currently tangled up in this eval:
1. Can an LLM navigate a slide effectively (i.e find all relevant regions of interest)? 2. Given a region of interest, can an LLM make the correct assessment?
I need to come up with a better test here in general but yep I'm thinking about this
How would a human classify the cancers? I assume the LLM training data does not include a whole bunch of cancer samples, so assumably there are some rules that it follows?
> While there exists several pathology-focused AI models
Would also be curious how the LLM compares to this and other approaches. What's the performance of the models trained specifically on this task, and random guessing, compared to the expert pathologist? Correct me if I'm wrong but this seems like the sort of task where being right 90% of the time is not good enough, so even if the LLM beats other approaches, it still needs to close the gap to human performance.
> What's the performance of the models trained specifically on this task, and random guessing, compared to the expert pathologist?
I should probably first clarify here, the disease classification tasks are about subtyping the type of cancer (i.e classifying a case as invasive ductal carcinoma of the breast) rather than just binary malignant/benign classification so random guessing is much more difficult and makes this model performance more impressive.
> Would also be curious how the LLM compares to this and other approaches.
There aren't a lot of public general purpose pathology benchmarks. There are some like (https://github.com/sinai-computational-pathology/SSL_tile_be...) but focus on just binary benign/malignant classification tasks and binary biomarker detection tasks.
I am currently working on self-hosting the available open-source models.
> this seems like the sort of task where being right 90% of the time is not good enough, so even if the LLM beats other approaches, it still needs to close the gap to human performance
Yep, your intuition is right here, and actually the expectation is probably closer to mid-high 90%, especially for FDA approval (and most AI tools position as co-pilots at the moment). There is obviously a long way to go, but what I find about interesting about this approach is that it allows LLMs to generalize across (1) a variety of tissue types and (2) pathology tasks such as IHC H-score scoring.
Did you fine tune GPT 5, Sonnet 4.5, or any of the other models? Or, were the models able to do this "out of the box?"
Nope, I just did some prompt engineering on ootb models. I thought about doing some fine-tuning on like Qwen but think that there is still more performance to be squeezed out with just prompts here.
none of the models you mentioned are open source...
Do you think finetuning these LLMS would bring about comparable results to specific models trained for this?
I think so. It feels like there is more to be squeezed from just better prompts but was going to play around with fine-tuning Qwen3
fair enough. I wonder if fine-tuning over different modalities like IMC, H&E etc would help it generalize better across all
Yeah I think one of the things that would be interesting is to see how well it generalizes across tasks. It seems like the existence of pathology foundation models means there is certainly a degree of generalizability (at least across tissues) but I am not too sure yet about generalizability across different modalities (there are some cool biomarker-prediction models though)
Wow this is pretty interesting. Excited to see the benchmark!
very cool, have you tried some of the newer segmenting models to see if they make a difference? I've seen some in the past two weeks that look really effective...I wonder if it could help out the RL environment
Nope I haven't, I can take a look and see if I can fit it in
You should read out to Eric Topol...