Honestly this benchmark feels completely dominated by the instance's NIC capacity.
They used a c5.4xlarge that has peak 10Gbps bandwidth, which at a constant 100% saturation would take in the ballpark of 9 minutes to load those 650GB from S3, making those 9 minutes your best case scenario for pulling the data (without even considering writing it back!)
Minute differences in how these query engines schedule IO would have drastic effects in the benchmark outcomes, and I doubt the query engine itself was constantly fed during this workload, especially when evaluating DuckDB and Polars.
The irony of workloads like this is that it might be cheaper to pay for a gigantic instance to run the query and finish it quicker, than to pay for a cheaper instance taking several times longer.
> They used a c5.4xlarge that has peak 10Gbps bandwidth, which at a constant 100% saturation would take in the ballpark of 9 minutes to load those 650GB from S3, making those 9 minutes your best case scenario for pulling the data (without even considering writing it back!)
The query being tested wouldn't scan the full files and in reality the query in most sane engines would be processing much less than 650GB of data: i.e. 1 columns of them all: a timestamp, which is also correlated with the partition keys. Nowadays what I would mostly be worried about the distribution of file size, due to API calls + skew; or if the query is totally different to the common query access patterns that skips the metadata/columnar nature of the underlying parquet (i.e. doing an effective "full scan" over all row groups and/or columns).
> The irony of workloads like this is that it might be cheaper to pay for a gigantic instance to run the query and finish it quicker, than to pay for a cheaper instance taking several times longer.
It would be amusing to run this on a regular desktop computer or even a moderately nice laptop (with a fan - give it a chance!) and see how it does. 650GB will stream in quite quickly from any decent NVMe device, and those 8-16 cores might well be considerably faster than whatever cores the cloud machines are giving you.
S3 is an amazingly engineered product, operates at truly impressive scale, is quite reasonably priced if you think of it as warm-to-very-cold storage with excellent durability properties, and has performance that barely holds a candle to any decent modern local storage device.
Absolutely. I recently reworked a bunch of tests and found my desktop to outcompete our (larger, custom) Github Action runner by roughly 5x. And I expect this delta to increase a lot as you lean on the local I/O harder.
It really is shocking how much you're paying given how little you get. I certainly don't want to run a data center and handle all the scaling and complexity of such an endeavour. But wow, the tax you pay to have someone manage all that is staggering.
10Gbps only? At Google where this type of processing would automatically be distributed, machines had 400Gbps NICs, not to mention other innovations like better TCP congestion control algorithms. No wonder people are tired of distributed computing.
"At Google" is doing all the heavy lifting in your comment here, with all due respect. There is but one Google but remain millions of us who are not "At Google".
I’m merely describing the infrastructure that at least partially led to the success of distributed data processing. Also 400Gbps NIC isn’t a Google exclusive. Other clouds and on-premise DCs could buy them from Broadcom or other vendors.
c5 is such a bad instance type, m6a would be so much better and even cheaper,
I would love to see this on an m8a.2xlarge (7th and 8th generations don’t use SMT) and that is even cheaper and has up to 15 Gbps
> It seems like these single-node libraries can process a terabyte on a typical machine, and you'd have have over 10TB before moving to Spark.
I'm surprised by how often people jump to Spark because "it's (highly) parallelizable!" and "you can throw more nodes at it easy-peasy!" And yet, there are so many cases where you can just do things with better tools.
Like the time a junior engineer asked for help processing 100s of ~5GB files of JSON data which turned out to be doing crazy amounts of string concatenation in Python (don't ask). It was taking something like 18 hours to run, IIRC, and writing a simple console tool to do the heavy lifting and letting Python's multiprocessing tackle it dropped the time to like 35 minutes.
I used pySpark some time ago when it was introduced to my company at the time and I realized that it was slow when you used python libraries in the UDFs rather than pySpark's own functions.
DuckLake format has an unresolved built-in chicken and egg conflict: it requires SQL database to represent its catalog. But this is what some people are running away from when they choose Parquet format in the first place. Parquet = easy, SQL = hard, adding SQL to Parquet makes the resulting format hard. I would expect a catalog to be in Parquet format as well, then it becomes something self-bootstrapping and usable.
It is not a chicken and egg problem, it is just a requirement to have an RDBMS available for systems like DuckLake and Hive to store their catalogs in. Metadata is relatively small and needs to provide ACID r/w => great RDBMS use case.
One thing that I never really see mentioned in these types of articles is that a lot of DuckDB’s functionality does not work if you need to spill to disk. iirc, percentiles/quartiles (among other aggregate functions) caused DuckDB to crash out when it spilled to disk.
Awk? https://adamdrake.com/command-line-tools-can-be-235x-faster-...
Honestly this benchmark feels completely dominated by the instance's NIC capacity.
They used a c5.4xlarge that has peak 10Gbps bandwidth, which at a constant 100% saturation would take in the ballpark of 9 minutes to load those 650GB from S3, making those 9 minutes your best case scenario for pulling the data (without even considering writing it back!)
Minute differences in how these query engines schedule IO would have drastic effects in the benchmark outcomes, and I doubt the query engine itself was constantly fed during this workload, especially when evaluating DuckDB and Polars.
The irony of workloads like this is that it might be cheaper to pay for a gigantic instance to run the query and finish it quicker, than to pay for a cheaper instance taking several times longer.
> They used a c5.4xlarge that has peak 10Gbps bandwidth, which at a constant 100% saturation would take in the ballpark of 9 minutes to load those 650GB from S3, making those 9 minutes your best case scenario for pulling the data (without even considering writing it back!)
The query being tested wouldn't scan the full files and in reality the query in most sane engines would be processing much less than 650GB of data: i.e. 1 columns of them all: a timestamp, which is also correlated with the partition keys. Nowadays what I would mostly be worried about the distribution of file size, due to API calls + skew; or if the query is totally different to the common query access patterns that skips the metadata/columnar nature of the underlying parquet (i.e. doing an effective "full scan" over all row groups and/or columns).
> The irony of workloads like this is that it might be cheaper to pay for a gigantic instance to run the query and finish it quicker, than to pay for a cheaper instance taking several times longer.
That's absolutely right.
It would be amusing to run this on a regular desktop computer or even a moderately nice laptop (with a fan - give it a chance!) and see how it does. 650GB will stream in quite quickly from any decent NVMe device, and those 8-16 cores might well be considerably faster than whatever cores the cloud machines are giving you.
S3 is an amazingly engineered product, operates at truly impressive scale, is quite reasonably priced if you think of it as warm-to-very-cold storage with excellent durability properties, and has performance that barely holds a candle to any decent modern local storage device.
Absolutely. I recently reworked a bunch of tests and found my desktop to outcompete our (larger, custom) Github Action runner by roughly 5x. And I expect this delta to increase a lot as you lean on the local I/O harder.
It really is shocking how much you're paying given how little you get. I certainly don't want to run a data center and handle all the scaling and complexity of such an endeavour. But wow, the tax you pay to have someone manage all that is staggering.
Everyone wants a data lake when what they have a is a data pond.
10Gbps only? At Google where this type of processing would automatically be distributed, machines had 400Gbps NICs, not to mention other innovations like better TCP congestion control algorithms. No wonder people are tired of distributed computing.
"At Google" is doing all the heavy lifting in your comment here, with all due respect. There is but one Google but remain millions of us who are not "At Google".
I’m merely describing the infrastructure that at least partially led to the success of distributed data processing. Also 400Gbps NIC isn’t a Google exclusive. Other clouds and on-premise DCs could buy them from Broadcom or other vendors.
The infra might have a 400Gbps NIC, but if you're buying a small compute slice on that infra, you don't get all the capability.
I'm kind of suprised they didn't choose an ec2 instance with higher throughput. S3 can totally eek out 100s of Gibps with the right setup.
BUT the author did say this is the simple stupid naive take, in which case DuckDB and Polars really shined.
c5 is such a bad instance type, m6a would be so much better and even cheaper, I would love to see this on an m8a.2xlarge (7th and 8th generations don’t use SMT) and that is even cheaper and has up to 15 Gbps
If I understand correctly, polars relies on delta-rs for Delta Lake support, and that is what does not support Deletion vectors: https://github.com/delta-io/delta-rs/issues/1094
It seems like these single-node libraries can process a terabyte on a typical machine, and you'd have have over 10TB before moving to Spark.
> It seems like these single-node libraries can process a terabyte on a typical machine, and you'd have have over 10TB before moving to Spark.
I'm surprised by how often people jump to Spark because "it's (highly) parallelizable!" and "you can throw more nodes at it easy-peasy!" And yet, there are so many cases where you can just do things with better tools.
Like the time a junior engineer asked for help processing 100s of ~5GB files of JSON data which turned out to be doing crazy amounts of string concatenation in Python (don't ask). It was taking something like 18 hours to run, IIRC, and writing a simple console tool to do the heavy lifting and letting Python's multiprocessing tackle it dropped the time to like 35 minutes.
Right cool for the right job, people.
I used pySpark some time ago when it was introduced to my company at the time and I realized that it was slow when you used python libraries in the UDFs rather than pySpark's own functions.
Presto (a.k.a. AWS Athena) might be a faster/better alternative? Also would like to see if 650GB data is available locally.
DuckDb has a new "DuckLake" catalog format that would be another candidate to test. https://ducklake.select/
DuckLake format has an unresolved built-in chicken and egg conflict: it requires SQL database to represent its catalog. But this is what some people are running away from when they choose Parquet format in the first place. Parquet = easy, SQL = hard, adding SQL to Parquet makes the resulting format hard. I would expect a catalog to be in Parquet format as well, then it becomes something self-bootstrapping and usable.
It is not a chicken and egg problem, it is just a requirement to have an RDBMS available for systems like DuckLake and Hive to store their catalogs in. Metadata is relatively small and needs to provide ACID r/w => great RDBMS use case.
One thing that I never really see mentioned in these types of articles is that a lot of DuckDB’s functionality does not work if you need to spill to disk. iirc, percentiles/quartiles (among other aggregate functions) caused DuckDB to crash out when it spilled to disk.
I’m pretty sure I’ve done this and not had any issues. Can you share a minimum reproducible example?
650GB relates to size of parquet files which are compressed in reality it’s way more.
32 GB of parquet cannot fit in 32GB of RAM
This would speed things up since it looks like the bottleneck here is I/O.
650GB? We have 72PB IN S3, know people who have multiple EB in S3.
That has to be multiple data sets. How big are your individual nightly jobs, and what are you processing them with?
This is not a game of, “mine is bigger than yours”. Many many workloads in the wild are smaller than this.
Motherduck have a few posts about how few people have “big data”. https://motherduck.com/blog/redshift-files-hunt-for-big-data...
I hate this screenshots for commands and outputs everywhere
6$ of data does not a compelling story make. This is not 1998