On Wed, Jul 13, 2011 at 10:52 AM, Merlin Moncure <mmoncure@xxxxxxxxx> wrote: > On Tue, Jul 12, 2011 at 6:15 PM, lars <lhofhansl@xxxxxxxxx> wrote: >> Back to the first case, here's an strace from the backend doing the select >> right after the updates. >> write(13, >> "f\320\1\0\1\0\0\0\273\0\0\0\0\340\27\22`\32\0\00000002833!000"..., 2400256) >> = 2400256 > > On Wed, Jul 13, 2011 at 9:46 AM, Kevin Grittner > <Kevin.Grittner@xxxxxxxxxxxx> wrote: >> Code comments indicate that they expect the pruning to be a pretty >> clear win on multiple reads, although I don't know how much that was >> benchmarked. Jeff does raise a good point, though -- it seems odd >> that WAL-logging of this pruning would need to be synchronous. We >> support asynchronous commits -- why not use that feature > > Right -- here are my thoughts. notice the above is writing out 293 > pages. this is suggesting to me that Kevin is right and you've > identified a pattern where you are aggravating the page cleanup > facilities of HOT. What threw me off here (and perhaps bears some > additional investigation) is that early on in the report you were > claiming an update to an indexed field which effectively disables HOT. There are couple of other (very important) things that HOT does, but probably its not advertised a lot. Even for non-HOT updates (which means either indexed columns were changed or page ran out of free space) or deletes, HOT prunes those tuples and instead mark the line pointer as DEAD. The page is defragmented and dead space is recovered. Each such dead tuple now only consumes two bytes in the page until vacuum removes the dead line pointers. Thats the reason why OP is seeing the behavior even when index columns are being updated. We made a few adjustments to ensure that a page is not pruned too early. So we track the oldest XID that did any updates/deletes to the page and attempt pruning only when the RecentXmin is past the XID. We also mark the page as "full" if some previous update did not find enough free space to do in-block update and use that hint to decide if we should attempt to prune the page. Finally, we prune only if we get the cleanup lock without blocking. What might be worth looking at this condition in pruneheap.c: /* * We prune when a previous UPDATE failed to find enough space on the page * for a new tuple version, or when free space falls below the relation's * fill-factor target (but not less than 10%). * * Checking free space here is questionable since we aren't holding any * lock on the buffer; in the worst case we could get a bogus answer. It's * unlikely to be *seriously* wrong, though, since reading either pd_lower * or pd_upper is probably atomic. Avoiding taking a lock seems more * important than sometimes getting a wrong answer in what is after all * just a heuristic estimate. */ minfree = RelationGetTargetPageFreeSpace(relation, HEAP_DEFAULT_FILLFACTOR); minfree = Max(minfree, BLCKSZ / 10); if (PageIsFull(page) || PageGetHeapFreeSpace(page) < minfree) { So if the free space in a page falls below the fill-factor or 10% of the block size, we would try to prune the page. We probably need to revisit this area and see if we need to tune HOT ever better. One option could be to see how much space we are going to free and carry out the operation only if its significant enough to justify the cost. I know we had done several benchmarking tests while HOT development, but the tuning mechanism still may not be perfect for all kinds of work loads and it would probably never be. Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance