Less intrusive ways to cluster?

"Hans Guijt" <hg@xxxxxxxxx> · Tue, 8 Jan 2008 17:44:10 +0100

I have a fairly large table that keeps track of data measured by our
system. The data is in the form of BLOBs, and is only queried in order
of timestamp. Because we are measuring from multiple devices, the data
does not necessarily arrive in the database in correct temporal order,
and in fact it is very well possible for corrections to some BLOBs to be
entered after measurement. As a result, there is considerable churn in
the table - but only at the very end, in the last hour or so of data.
Older data is normally left alone.

I have noticed that clustering the table improves my query speed
considerably, so I would like to keep the table clustered (on the index
of the timestamp column, of course). However, I'm wondering what the
best approach is and would appreciate ideas from this list.

So far I have experimented with running some form of clustering every 24
hours. I have tried the following approaches:

* If I use the "cluster" command, the table is locked for a long time
(half an hour or more, on some data sets). This is unacceptable because
this also stops all reads from the table, and some output processing
relies on being able to get data in timely fashion. 

* ...but if I copy the entire table, as per the instructions in the
CLUSTER manual page, it appears the table is still locked? At least I
still cannot read from it during this operation. 

I don't really mind the clustering taking a long time, as long as I can
still read from the table (I can hold up writes for a long time, but
reads are problematic). Is there some way to cluster without blocking
reads?

Alternatively, is there some way to do partial clustering? Since 99% of
my data set will already be properly clustered, except for the last 24
hours or so of data, just clustering that last bit (which is trivial by
comparison) would already help a great deal. However, I'm not sure how
to achieve this. 

I have also considered a complex scheme, using two tables to represent
the data: one for long term storage that won't churn as much, and one
for short-term storage that has all the churn. I could then copy data
from the short-term table to the long-term table every 24 hours, and use
a view to still make it visible as a unified whole. However, this does
appear to be an overly complex solution for this problem. Any better
ideas?

Thanks,

Hans Guijt

---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster