Re: Table partition with primary key in 11.3

Peter Geoghegan <pg@xxxxxxx> · Fri, 7 Jun 2019 14:13:59 -0700

On Fri, Jun 7, 2019 at 1:22 PM Alvaro Herrera <alvherre@xxxxxxxxxxxxxxx> wrote:
> Somehow we ended up discussing this topic in a rather mistitled thread
> ... oh well :-)  (Nowadays I hesitate to change threads' subject lines,
> because gmail).

You can blame me for that, I think.

> > It occurs to me that we could add a code path to nbtree page splits,
> > that considered removing dropped partition tuples to avert a page
> > split. This would be a bit like the LP_DEAD/kill_prior_tuple thing.
> > Technically the space used by index tuples that point to a dropped
> > partitions wouldn't become reclaimable immediately, but it might not
> > matter with this optimization.
>
> This seems useful on the surface: you drop a partition, and slowly and
> incrementally any index items that point to it are removed by processes
> scanning the index.  You can't rely solely on this, though: as pointed
> out by Robert in the indirect index thread, doing this only means that
> non-scanned parts of the index to retain entries for arbitrary long,
> which is bad.  Also, this adds latency to client-connected processes.

Well, we don't have to rely on index scans to set the LP_DEAD bit in
this case. We probably wouldn't do that at all. Rather, we'd have the
page split code refer to a list of dropped partition numbers as
targets for killing immediately. Maybe we'd hint the number of
distinct partitions represented on the page, to make it a bit faster.

> Because you can't rely on that exclusively, and you want to reuse the
> partition ID eventually, you still need a cleanup process that removes
> those remaining index entries.  This cleanup process is a background
> process, so it doesn't affect latency.  I think it's not a good idea to
> add latency to clients in order to optimize a background process.

Ordinarily I would agree, but we're talking about something that takes
place at the point that we're just about to split the page, that will
probably make the page split unnecessary when we can reclaim as few as
one or two tuples. A page split is already a very expensive thing by
any measure, and something that can rarely be "undone", so avoiding
them entirely is very compelling. Delaying a split will often prevent
it altogether. We're already doing foreground processing, just by
having page splits at all.

Other DB systems that don't do much foreground processing will still
do a certain amount of it if that avoids a split in some cases --
"Modern B-Tree techniques" mentions this, and suggests quite a number
of ways that a split might be averted.

> This way, when a partition is dropped, we have to take the time to scan
> all global indexes; when they've been scanned we can remove the catalog
> entry, and at that point the partition ID becomes available to future
> partitions.

It seems worth recycling partition IDs, but it should be possible to
delay that for a very long time if necessary. Ideally, users wouldn't
have to bother with it when they have really huge global indexes.

> > The nbtree heap TID column and partition number column should probably
> > be a single varwidth column (not two separate columns), that is often
> > no wider than 6 bytes, but can be wider when there are many partitions
> > and/or very large partitions. That will be challenging, but it seems
> > like the right place to solve the problem. I think that I could make
> > that happen. Maybe this same representation could be used for all
> > nbtree indexes, not just global nbtree indexes.
>
> Maybe local nbtree indexes would have a partition ID of length 0, since
> that many bits are necessary to identify which table is pointed to by
> each index item.

Right -- special cases are best avoided here. In general, we'd push as
much of the new complexity as we can into this new TID-like table
identifier, while requiring it to work with our existing requirements
for TIDs, plus certain new requirements for global indexes (and maybe
other new requirements, such as relations that are larger than 35GB).
If the complexity is well-encapsulated, then it probably won't be too
bad. Access methods would have to be okay with varwidth table
identifiers, which is a big change, but they at least shouldn't have
to worry about anything else breaking. They'd probably have a
pg_attribute entry for the varwidth table identifier column, too (it
would be the last column in every nbtree index).

We'd expect a space efficient representation with real world
relations, that at least matches what we get with heap TIDs today.
This isn't quite as hard as it sounds. You don't have to be Claude
Shannon to realize that it's kind of silly to reserve 16 bits for the
offset number component of a TID/ItemPointer. We need to continue to
support offset numbers that go that high, but the implementation would
optimize for the common case where offset numbers are less than 512
(or maybe less than 1024).

--
Peter Geoghegan