By "SEALED", do you mean "READ ONLY"?
On 9/6/22 14:39, Levi Aul wrote:
My company runs some large OLAP data warehouses with append-only,
time-partitioned datasets. Our workloads involve aggregations and joins,
and query the data in ways not amenable to constraint-exclusion; and we
serve a high concurrent number of these queries at once from a single DB.
In other words, our workload is inherently one that acquires "way too
many locks." Our largest performance bottleneck, according to
pg_wait_sampling, is the LockManager itself. Despite most of our queries
spending only milliseconds actually executing, they often spend seconds
during planning waiting to acquire hundreds of access-shared locks.
Given that our datasets are append-only, all our partitions for each table
save for the one "active" one (the one for the current time period) are
effectively immutable. No DML-triggered writes will occur to these. I
think this is pretty common in data-warehouse use-cases of PG.
If PG could avoid the need to acquire the locks for these
effectively-immutable partitions, then the remaining number of tables
would be low enough to fit into the per-backend LWLock slots set, and so
avoid LockManager contention. I believe this could be a large optimization
not just for our use-case, but in a number of other high-concurrency OLAP
use-cases.
My proposal for how this "lock elision under large numbers of immutable
partitions" could be accomplished:
1. Add some DDL statement to mark partitions as sealed/unsealed. (ALTER
TABLE ... SEAL PARTITION foo)
2. When query-planning DML against a partition or a partitioned table,
treat a sealed partition as if it had an always-false check constraint.
2. Define a "locking group" abstraction, where many entities can register
themselves under the same lock, such that access to all members of the
locking group requires only acquiring the single locking-group lock. All
sealed partitions of the same table would share a locking group.
Under such a setup, querying a time-based partitioned table with one
active (unsealed) partition would only ever require acquiring, at most,
two locks — the one for the active partition, and the one for the
sealed-partitions locking group.
The trade-off for this is that acquiring an exclusive-access lock on the
sealed-partitions locking-group for a table becomes much more expensive
than it would have been to acquire for a single partition. But this isn't
a problem in practice, because hot-path operations that take an
exclusive-access lock (DML writes) are disallowed against sealed
partitions. The only time the lock-group would need to be exclusive-access
acquired, would be to change its membership — an administrative DDL operation.
Besides being useful operationally, such a mechanism would also be helpful
on the business-logic level, as you can rely on partition sealing to turn
accidental insertions of new data into any but the active partition(s)
into a constraint violation. (Currently, to achieve this, separate
triggers need to be maintained on each sealed partition.)
And, with knowledge of the administrative intent for a table to be
immutable, further operational optimizations could be performed. A few off
the top of my head:
1. Running CLUSTER or VACUUM (FULL, FREEZE) after the partition is marked
as immutable, could rewrite the table using an implicit "heap_immutable"
access method (still reported as "heap"), which would drop the min_xid
column (as everything in a sealed table is guaranteed to be
always-visible), and thus remove the table for consideration for
xid-wraparound-protection rewriting. Such partitions would then require a
rewrite back to "heap" if unsealed.
2. Alternatively, such storage-rewriting DDL statements could switch the
table — and its indices — over to using an entirely different
access-methods, which would store the data+indices in "perfect" packed
forms, to maximize read performance while also minimizing disk usage.
3. ANALYZE could have an (otherwise-impractical) "EXACT" argument, to
populate statistics with exact aggregate values, requiring reading all
rows rather than sampling rows. This could pre-bake table-level aggregates
for most columns, like having a single, table-sized BRIN block-range.
If this concept of "marking as sealed" were extended to tables rather than
only partitions, then further work could be done related to optimization
of bulk loads — e.g. having CREATE TABLE AS ... SEALED not generate WAL
segments for the table as it is populated, but rather treat the table as
UNLOGGED during population, and then, after creation, take the entire
finalized/sealed table's backing files and either pass them directly to
archive_command / send them directly to WAL receivers; or split+stream
them into retrospective WAL segments (each segment containing a single
"put this 16MB of data into this file at this position" op), and send those.
--
Angular momentum makes the world go 'round.