While most research is commendable I think this feels a bit as one that goes in from the wrong starting point.
Unified memory has become a thing (Apple machines, Nvidia AI machines like the GH200, recent AMD "AI" machines) and as people are aware AI workloads (similar to DB) are bandwidth bound (why we often use 4bit and 8bit values today), to become compute bound one would need to do more expensive stuff than graphics shaders (not common in DB queries).
So, the focus of research should be:
A: How are the queries in these setups vs simply running on unified memory machines, is there enough of a win for discrete to trounce the complexity (the GH200 perf advantage seems to partially answer it since iirc it's unified?).
B: What is the overhead of firing off query operations VS just running on-CPU? is query compilation overhead noticable if it's mostly novel non-cached queries?
C: For keeping it on the GPU, are there options today for streaming directly to-GPU bypassing ram / host entirely?
(Did skim a bit, heading out)
While most research is commendable I think this feels a bit as one that goes in from the wrong starting point.
Unified memory has become a thing (Apple machines, Nvidia AI machines like the GH200, recent AMD "AI" machines) and as people are aware AI workloads (similar to DB) are bandwidth bound (why we often use 4bit and 8bit values today), to become compute bound one would need to do more expensive stuff than graphics shaders (not common in DB queries).
So, the focus of research should be:
A: How are the queries in these setups vs simply running on unified memory machines, is there enough of a win for discrete to trounce the complexity (the GH200 perf advantage seems to partially answer it since iirc it's unified?).
B: What is the overhead of firing off query operations VS just running on-CPU? is query compilation overhead noticable if it's mostly novel non-cached queries?
C: For keeping it on the GPU, are there options today for streaming directly to-GPU bypassing ram / host entirely?