Re: clustering without locking

Craig Ringer <craig@xxxxxxxxxxxxxxxxxxxxx> · Sun, 04 May 2008 02:19:14 +0800

Tom Lane wrote:
Craig Ringer <craig@xxxxxxxxxxxxxxxxxxxxx> writes:
Later on, though, less new space would have to be allocated because more 
and more of the space allocated earlier to hold moved tuples would be 
being freed up in useful chunks that could be reused.

I don't see how that works.  If the minimum size of the table is X
pages, ISTM that the first pass has to push everything up to pages above
X.

What I was suggesting was to essentially cluster the table in small 
parts. Rather than two huge passes (move all tuples to free / newly 
allocated space at end of table ; move back into old locations in order) 
 it'd be done in a series of smaller moves.

After moving a chunk out of the way and into free/new space at the end 
of the table data would be copied from later in the table into the freed 
space. That space could then be re-used to hold data from the next chunk 
that needs to be moved out of the way.

I'm just trying to understand if it can actually work. Sorry if my 
attempted explanations are unclear; I'm probably doing a terrible job, 
and it's probably actually a stupid idea anyway (if nothing else it 
might just be too slow). Nonetheless I'm curious. Maybe I can explain 
another way.

Visually:

`0' to `9' : tuples. Desired eventual cluster order is face value.
`.' : Dead/obsoleted tuple not yet marked reusable by VACUUM
` ' : free space

Initial state:

---------
584736120
---------

Begin a transaction and free the first chunk (2 tuples in this case, but 
obviously many more in a real case):

-----------
..473612058
-----------

Use that freed space to store the first ordered tuples:

-----------
014736.2.58
-----------

Commit, and when the last transaction to which the "." tuples above are 
still visible completes mark them as free with VACUUM or similar.

-----------
014736 2 58
-----------

Repeat, but now use the freed space to store the next set of tuples that 
must be moved rather than extending the table:

-----------
01..3642758
-----------
-----------
0123.64.758
-----------
-----------
0123 64 758
-----------

During the next pass someone inserts `9' after tuples have been moved to 
 make a hole and others have been moved into the hole, but before the 
old locations of the moved tuples are marked as free:

-----------
0123 .46758
-----------
-----------
012345.67.8
-----------
------------
012345.67.89          <- INSERT 9
------------
------------
012345 67 89
------------

You'd never land up with this sort of convenient ordering half way 
through in a real case with realistic numbers of tuples, so it'd keep 
going, doing small chunks each time, until the whole table had been 
processed.

So, the table does grow, and its final state does contain dead space at 
the end, but not *too* much of it:

------------
0123456789
------------

If "low" values are inserted late in the progressive cluster they can 
just stay at the end of the table. They'll get moved if the user runs a 
progressive cluster operation again later. However, since you're ideally 
doing this with a non-100% fillfactor, they should land up in roughly 
the right place on initial insert rather than at the end of the table, 
avoiding the issue (and helping avoid having newly inserted tuples 
prevent table truncation by vacuum when the progressive clustering 
finishes).

Say a table containing the range 1-9 was being clustered with a 50% 
fillfactor and was about half way through the process:

-------------
 1 2 3 47986
-------------

and someone inserts `0' it should ideally land up in the right place 
anyway (as a bit of reading suggests that insert tries to respect 
cluster order):

-------------
01 2 3 47986
-------------

If you're re-clustering a table with a fillfactor set you may not need 
to extend it at all, because you can use already allocated free space to 
store tuples temporarily while moving them around.

You can't put any temporary copies in pages <= X because you might
need that space when it comes time to make the clustering happen.  So
the table is going to bloat to (at least) 2X pages.

Yes, if you perform the whole process in a single huge chunk. What I was 
hoping was that it was possible to do it in a series of smaller 
operations instead, avoiding the need for such a huge amount of bloat at 
the end of the table.

Say you're clustering in 10 steps. You need X*0.1 worth of scratch space 
created by extending the table if there's not already appropriate free 
space late in the table. Assuming you can reclaim all space you've used 
for temporary copies of tuples after each pass your ideal final table 
size is X*1.1+(space used by new inserts), of which X*0.1 is free space 
at the end of the table. That free space probably has scattered rows 
from concurrent inserts, but if you're using a non-100% fillfactor it 
might not.

It seems like it should work, *if* space used for temporary storage can 
be efficiently reclaimed for reuse (or new inserts) after each pass. 
Even if it can work, though, I don't know if it'd actually be useful in 
the face of the effect on indexes and because of how slow it might land 
up being.

--
Craig Ringer