Re: clustering without locking

Craig Ringer <craig@xxxxxxxxxxxxxxxxxxxxx> · Sat, 03 May 2008 08:21:43 +0800

Scott Ribe wrote:
Wouldn't new / updated tuples just get put in the hole, fairly rapidly
un-clustering the table again?

How is that different than putting them in newly-allocated space at the end
of the table?

It isn't, I just wasn't thinking straight.

This is probably a stupid idea as I'm fairly clueless about Pg's 
innards, but I find myself wondering about doing a non-exclusive cluster 
in small progressive chunks in series of short transactions.

In each transaction tuples from a smallish contiguous chunk of pages 
(beginning at the start of the table and moving onward as the cluster 
progresses) are copied to free or newly allocated space later in the 
table and given the xid of the transaction. The old versions are marked 
as having been obsoleted by this transaction. The transaction then 
commits. This step is much like a standalone UPDATE that's not actually 
changing the field values and that has some weird tuple selection criteria.

Progressive clustering then waits until the last transaction that could 
see the old copies of the tuples that were just relocated finishes. When 
the space that was occupied by the moved tuples becomes reclaimable (ie 
vacuum could mark it as free if it was to run) progressive clustering 
resumes. A new transaction is begun. It looks up tuples from the index 
being clustered on in order and reads enough to fill the space just 
freed into RAM. It then writes those tuples into the freed space 
(without ever actually marking it as free, so no other transactions 
could ever write to it), giving them the xid of the writing transaction. 
The old copies are marked as obsoleted by this transaction and the 
transaction commits. Again, it's like an UPDATE, but combined with a 
small selective vacuum that never bothers to update the free space map 
because it instantly reuses the space.

Now a small chunk of the table, at the start, is in order. The process 
can now begin again with the next chunk of unordered pages. It never 
needs to create a huge hole in the table or expand the table massively. 
It can even space the data out if a fillfactor has been set by leaving 
gaps as it writes the replacement data into the freed chunks and 
updating the free space map.

The progressive cluster would also need to be able to run a periodic 
vacuum (or do the equivalent internally) with a restriction that 
prevented it from examining pages in the range the progressive cluster 
was currently trying to free. Otherwise all the space freed by old 
versions of rows that're now neatly ordered in the table wouldn't be 
freed and couldn't be used as scratch space.

In my ignorance of Pg's innards it seems like all this should work, 
though it'd basically never finish if there were long running 
transactions touching the table being clustered. The other potential 
problem I wonder about is bloating the indexes, since every record in 
the table is being moved twice over the course of the progressive 
cluster operation.

Oh: There's actually only a need for one transaction per step:

begin transaction
move ordered tuples into just-freed hole
move tuples from next block of pages to free space later in table
commit
wait for all transactions that can see the old versions to complete
repeat

So ... is this crazy? Concurrently clustering the table by moving each 
record *twice*, in batches, with pauses to allow old versions to cease 
being visible by any live transaction? Or can it actually work?

--
Craig Ringer