Re: cluster index on a table

Scott Carey <scott@xxxxxxxxxxxxxxxxx> · Thu, 16 Jul 2009 20:07:43 -0700

On 7/16/09 5:27 PM, "Greg Stark" <gsstark@xxxxxxx> wrote:

> On Fri, Jul 17, 2009 at 1:02 AM, Scott Carey<scott@xxxxxxxxxxxxxxxxx> wrote:
>> Indexes would point to a heap page for normal tables and clustered index
>> pages for clustered tables.  When new versions of data come in, it may point
>> to new clustered index pages, just like they currently get modified to point
>> to new heap pages with new data.  A split just means more data is modified.
>> And if multiple versions are still 'live' they point to multiple versions --
>> just as they now have to with the ordinary heap pages.  Page splitting only
>> means that there might be some copies of row versions with the same tid due
>> to a split, but labeling a page with a 'creation' tid to differentiate the
>> pre and post split pages removes the ambiguity.  I suppose that would
>> require some page locking.
> 
> 
> None of this makes much sense to me.
> 
> If you keep the old tuples around when you split a full page where are
> you going to put the new tuples you're inserting?

The new page?
Simplest solution - split into two or more new pages and copy.  Now there
are two copies of all the old stuff, and one copy of the new tuples (in the
new pages).
One could optimize it to only create one new page and copy half the tuples,
depending on where the inserted ones would go, but its probably not a big
win and makes it more complicated.

In the current scheme, when an update occurs and the HOT feature is not
applicable, there is a copy-on-write for that tuple, and AFAICS all the
indexes must point to both those tuples as long as the previous one is still
visible -- so there are two or more index references for the same tuple at
different versions.

In the case I am describing, all the tuples in the page that is split would
now have multiple entries in all indexes (which isn't a cheap thing to do).
An old transaction can see all the index references, but either the tuples
or page could be marked to track visibility and remove the ambiguity.  If
the transaction id/number ('spit id'?) that caused the split and created the
new pages is stored in the new pages, all ambiguity can be resolved for a
unique index.  With a non-unique index there would have to be some tuple
level visibility check or store more information in the index (how does it
work now?).

> 
> Also, are you clear what a tid is? It's the block number plus the
> index to the tuple entry. If you split the page and move half the
> tuples to the new page they'll all have different tids.
> 

I was thinking tid = transaction id based on the context.  A version number
of sorts -- the one that each tuple is marked with for visibility.  But its
essentially the pointer to the tuple.  My mistake.

All tuples in a split page will likely have to be copied, and it will have
to be such that both copies exist until it can be sure the old copy is no
longer visible.  Its essentially the same as the copy on write now but
affects more than the tuple that is updated -- it affects all in the page
when there is a split.
For the actual index with the embedded tuples, that's not so bad since all
the references to these are co-located.  For other indexes, its an expensive
update.  

But, that is what FILLFACTOR and the like are for -- Making sure splits
occur less often, and perhaps even reserving some empty pages in the index
at creation time for future splits.  Maintenance and bloat is the tradeoff
with these things, but the performance payoff for many workloads is huge.

Am I missing something else?  Is it not possible to move tuples that aren't
being updated without something like an exclusive table lock?

I'm definitely not saying that the above would be easy, I just don't yet see
how it is incompatible with Postgres' MVCC.  Hard?  Probably.

> --
> greg
> http://mit.edu/~gsstark/resume.pdf
> 

-- 
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance