Tom Lane wrote:
Craig Ringer <craig@xxxxxxxxxxxxxxxxxxxxx> writes:
So ... is this crazy? Concurrently clustering the table by moving each
record *twice*, in batches, with pauses to allow old versions to cease
being visible by any live transaction? Or can it actually work?
It seems to me you'd have a pretty horrid bloat problem: at completion,
the table and all its indexes must be at least twice the minimum size,
I was hoping that wouldn't be the case for the table its self, though I
expected the indexes would be pretty messed up and need their own
cleanup process.
I'll try to explain my thinking, as I'm curious about what I've
misunderstood.
The documentation on VACUUM (well, on pg_freespacemap actually) suggests
that vacuum can recover space from pages that contain some free space
but are not wholly free. Is that right?
In this theoretical progressive cluster, tuples are moved in chunks from
their original locations to free space later in the table (probably
newly allocated at the end). However, if vacuum can recover partial
pages shouldn't it be marking some of the scratch space and originally
allocated space as free (thus permitting its reuse as scratch space) as
tuples are picked out and moved back to the start of the table in order?
Guesswork:
If the initial order of tuples in the table before clustering starts is
completely random then early on the table would expand by a full
progressive cluster chunk size each step, because the pages being moved
back to the start would be from all over the place and the space being
freed would be very scattered and hard to reclaim.
Later on, though, less new space would have to be allocated because more
and more of the space allocated earlier to hold moved tuples would be
being freed up in useful chunks that could be reused. That'd also permit
inserts and updates unrelated to the ongoing progressive clustering
process to be written inside already allocated space rather than being
appended to the table after all the progressive clustering scratch space.
So, if I understand vacuum's reclaiming correctly then even starting off
with a completely record order it should expand the table somewhat for
scratch space, but to less than double its original size. How much less,
and how much could be truncated at the end, would depend on how good
vacuum is at finding small holes to shove new/moved tuples into, and how
similar the tuple sizes are. Right?
That assumes that the initial ordering of tuples is in fact random. If
you're re-clustering a table it's probably far from random, and many of
the tuples will already be in roughly the right areas. That should
permit a much smaller allocation of scratch space, since much more of
the data from the scratch area will be copied back and marked as free
for reuse (for new unrelated inserts or for more moved tuples) within
the next few steps of the progressive cluster. Especially if there's a
non-100% fillfactor it should also be possible to truncate most of the
newly allocated space at the end, as new inserts can be put in sensible
places in the table rather than at the end.
Now, even if that's right the indexes will presumably be in an awful
state. I've noticed that PostgreSQL has `CREATE INDEX CONCURRENTLY' but
not `REINDEX CONCURRENTLY'. That's not too surprising, as nobody's
trying to use an index while you're initially creating it. If there's no
way to clean up the indexes after an operation like this then it's
probably not worth it ... so, is there any way to clean up / optimize
and index that doesn't require a long exclusive lock? A REINDEX
CONCURRENTLY equivalent?
--
Craig Ringer