On 7/16/09 5:27 PM, "Greg Stark" <gsstark@xxxxxxx> wrote: > On Fri, Jul 17, 2009 at 1:02 AM, Scott Carey<scott@xxxxxxxxxxxxxxxxx> wrote: >> Indexes would point to a heap page for normal tables and clustered index >> pages for clustered tables. When new versions of data come in, it may point >> to new clustered index pages, just like they currently get modified to point >> to new heap pages with new data. A split just means more data is modified. >> And if multiple versions are still 'live' they point to multiple versions -- >> just as they now have to with the ordinary heap pages. Page splitting only >> means that there might be some copies of row versions with the same tid due >> to a split, but labeling a page with a 'creation' tid to differentiate the >> pre and post split pages removes the ambiguity. I suppose that would >> require some page locking. > > > None of this makes much sense to me. > > If you keep the old tuples around when you split a full page where are > you going to put the new tuples you're inserting? The new page? Simplest solution - split into two or more new pages and copy. Now there are two copies of all the old stuff, and one copy of the new tuples (in the new pages). One could optimize it to only create one new page and copy half the tuples, depending on where the inserted ones would go, but its probably not a big win and makes it more complicated. In the current scheme, when an update occurs and the HOT feature is not applicable, there is a copy-on-write for that tuple, and AFAICS all the indexes must point to both those tuples as long as the previous one is still visible -- so there are two or more index references for the same tuple at different versions. In the case I am describing, all the tuples in the page that is split would now have multiple entries in all indexes (which isn't a cheap thing to do). An old transaction can see all the index references, but either the tuples or page could be marked to track visibility and remove the ambiguity. If the transaction id/number ('spit id'?) that caused the split and created the new pages is stored in the new pages, all ambiguity can be resolved for a unique index. With a non-unique index there would have to be some tuple level visibility check or store more information in the index (how does it work now?). > > Also, are you clear what a tid is? It's the block number plus the > index to the tuple entry. If you split the page and move half the > tuples to the new page they'll all have different tids. > I was thinking tid = transaction id based on the context. A version number of sorts -- the one that each tuple is marked with for visibility. But its essentially the pointer to the tuple. My mistake. All tuples in a split page will likely have to be copied, and it will have to be such that both copies exist until it can be sure the old copy is no longer visible. Its essentially the same as the copy on write now but affects more than the tuple that is updated -- it affects all in the page when there is a split. For the actual index with the embedded tuples, that's not so bad since all the references to these are co-located. For other indexes, its an expensive update. But, that is what FILLFACTOR and the like are for -- Making sure splits occur less often, and perhaps even reserving some empty pages in the index at creation time for future splits. Maintenance and bloat is the tradeoff with these things, but the performance payoff for many workloads is huge. Am I missing something else? Is it not possible to move tuples that aren't being updated without something like an exclusive table lock? I'm definitely not saying that the above would be easy, I just don't yet see how it is incompatible with Postgres' MVCC. Hard? Probably. > -- > greg > http://mit.edu/~gsstark/resume.pdf > -- Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance