Scott Ribe wrote:
Wouldn't new / updated tuples just get put in the hole, fairly rapidly
un-clustering the table again?
How is that different than putting them in newly-allocated space at the end
of the table?
It isn't, I just wasn't thinking straight.
This is probably a stupid idea as I'm fairly clueless about Pg's
innards, but I find myself wondering about doing a non-exclusive cluster
in small progressive chunks in series of short transactions.
In each transaction tuples from a smallish contiguous chunk of pages
(beginning at the start of the table and moving onward as the cluster
progresses) are copied to free or newly allocated space later in the
table and given the xid of the transaction. The old versions are marked
as having been obsoleted by this transaction. The transaction then
commits. This step is much like a standalone UPDATE that's not actually
changing the field values and that has some weird tuple selection criteria.
Progressive clustering then waits until the last transaction that could
see the old copies of the tuples that were just relocated finishes. When
the space that was occupied by the moved tuples becomes reclaimable (ie
vacuum could mark it as free if it was to run) progressive clustering
resumes. A new transaction is begun. It looks up tuples from the index
being clustered on in order and reads enough to fill the space just
freed into RAM. It then writes those tuples into the freed space
(without ever actually marking it as free, so no other transactions
could ever write to it), giving them the xid of the writing transaction.
The old copies are marked as obsoleted by this transaction and the
transaction commits. Again, it's like an UPDATE, but combined with a
small selective vacuum that never bothers to update the free space map
because it instantly reuses the space.
Now a small chunk of the table, at the start, is in order. The process
can now begin again with the next chunk of unordered pages. It never
needs to create a huge hole in the table or expand the table massively.
It can even space the data out if a fillfactor has been set by leaving
gaps as it writes the replacement data into the freed chunks and
updating the free space map.
The progressive cluster would also need to be able to run a periodic
vacuum (or do the equivalent internally) with a restriction that
prevented it from examining pages in the range the progressive cluster
was currently trying to free. Otherwise all the space freed by old
versions of rows that're now neatly ordered in the table wouldn't be
freed and couldn't be used as scratch space.
In my ignorance of Pg's innards it seems like all this should work,
though it'd basically never finish if there were long running
transactions touching the table being clustered. The other potential
problem I wonder about is bloating the indexes, since every record in
the table is being moved twice over the course of the progressive
cluster operation.
Oh: There's actually only a need for one transaction per step:
begin transaction
move ordered tuples into just-freed hole
move tuples from next block of pages to free space later in table
commit
wait for all transactions that can see the old versions to complete
repeat
So ... is this crazy? Concurrently clustering the table by moving each
record *twice*, in batches, with pauses to allow old versions to cease
being visible by any live transaction? Or can it actually work?
--
Craig Ringer