-- "Improve the efficiency of planning queries accessing many partitions (Ashutosh Bapat, Yuya Watari, David Rowley)"
..."The actual performance increases here are highly dependent on the number
of partitions and the query being planned. Performance increases can be
visible with as few as 8 partitions, but the speedup is marginal for
such low numbers of partitions. The speedups become much more visible
with a few dozen to hundreds of partitions. With some tested queries
using 56 partitions, the planner was around 3x faster than before. For
use cases with thousands of partitions, these are likely to become
significantly faster. Some testing has shown planner speedups of 60x or
more with 8192 partitions."
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=d69d45a5a
You can do the same thing in MySQL, though its hash partition selection algorithm is far simpler: MOD(expr, N), where expr is your chosen expression on a given column, e.g. TO_DAYS(created_at), and N is the number of partitions you specified for the table.
So you can precompute the partitions in the app, and then explicitly specify the partitions in the query. Though there isn’t a ton of value in this for any large range of dates, since you’ll end up hitting all partitions anyway.
For something like a user id, it might make sense. If you’re using something alphanumeric as a user id, you can pass it through CRC32() first, or just use KEY partitioning on the column directly.
I’d be interested to see a benchmark of actually selecting rows by manually specifying the partition vs not.
This benchmark seems to be pure computation of the hash value, which I don’t think is helpful to test the hypothesis. A lot can happen at actual query time that this benchmark does not account for.
That's fair. I had an EXPLAIN ANALYZE which shows the time it takes for planning and execution. And sending the query directly to the table outperforms that.
Without a comparison of letting postgres calculate the partition this is just useless.
And who in their right mind would calculate a hash using a static SQL query that isn't even using the pg catalog hashing routine but a reimplementation.
Tangential: is "without requiring knowledge of data patterns" a frequently useful requirement? I.e. isn't knowledge of data patterns basically required for any performance optimization?
This sounds really interesting. Query planner acquires lightweight lock on every table partition and its indexes. There can be only 16 fast locks, exceed that number and locking becomes significantly more expensive slowing down queries significantly. Workaround I have used so far was to use plan_cache_mode=force_generic which prevents re-planning queries on parameters change.
This approach might be a better option, but sadly app needs to be modified to make use of it.
-- First level: 16 partitions by user_id
CREATE TABLE events_0 PARTITION OF events
FOR VALUES WITH (modulus 16, remainder 0)
PARTITION BY HASH (event_type);
shouldn't that be by user_id for the first 16 tables?
Note: PostgreSQL 18 includes many optimisations related to partitions; because of this, the improvement may be smaller. ( https://www.postgresql.org/docs/18/release-18.html )
You can do the same thing in MySQL, though its hash partition selection algorithm is far simpler: MOD(expr, N), where expr is your chosen expression on a given column, e.g. TO_DAYS(created_at), and N is the number of partitions you specified for the table.
So you can precompute the partitions in the app, and then explicitly specify the partitions in the query. Though there isn’t a ton of value in this for any large range of dates, since you’ll end up hitting all partitions anyway.
For something like a user id, it might make sense. If you’re using something alphanumeric as a user id, you can pass it through CRC32() first, or just use KEY partitioning on the column directly.
I’d be interested to see a benchmark of actually selecting rows by manually specifying the partition vs not.
This benchmark seems to be pure computation of the hash value, which I don’t think is helpful to test the hypothesis. A lot can happen at actual query time that this benchmark does not account for.
That's fair. I had an EXPLAIN ANALYZE which shows the time it takes for planning and execution. And sending the query directly to the table outperforms that.
More to come soon.
Without a comparison of letting postgres calculate the partition this is just useless.
And who in their right mind would calculate a hash using a static SQL query that isn't even using the pg catalog hashing routine but a reimplementation.
I'm baffled.
Tangential: is "without requiring knowledge of data patterns" a frequently useful requirement? I.e. isn't knowledge of data patterns basically required for any performance optimization?
This sounds really interesting. Query planner acquires lightweight lock on every table partition and its indexes. There can be only 16 fast locks, exceed that number and locking becomes significantly more expensive slowing down queries significantly. Workaround I have used so far was to use plan_cache_mode=force_generic which prevents re-planning queries on parameters change.
This approach might be a better option, but sadly app needs to be modified to make use of it.
Is this correct?
-- Parent table CREATE TABLE events ( id bigint, user_id bigint, event_type integer, payload text, created_at timestamp ) PARTITION BY HASH (user_id);
-- First level: 16 partitions by user_id CREATE TABLE events_0 PARTITION OF events FOR VALUES WITH (modulus 16, remainder 0) PARTITION BY HASH (event_type);
shouldn't that be by user_id for the first 16 tables?
this looks like maintenance nightmare going forward, but i could be wrong.
If you are stuck on specific pg version for a while, maybe it's worth it.
hash won't change on PG version upgrade, because that would cause a massive reshuffling by pg_upgrade: something PG managed to avoid so far.
nice to see ruby getting ahead