On Fri, Jul 7, 2017 at 4:41 PM, Justin Pryzby <pryzby@xxxxxxxxxxxxx> wrote: > The second change averages separate correlation values of small lists of 300 > consecutive TIDs, rather than the course-granularity/aggregate correlation of a > small sample of pages across the entire index. Postgres' existing sampling is > designed to give an even sample across all rows. An issue with this > course-granularity correlation is that the index can have a broad correlation > (to physical heap location), but with many small-scale deviations, which don't > show up due to sampling a relatively small fraction of a large table; and/or > the effect of the deviations is insignificant/noise and correlation is still > computed near 1. > > I believe the "large scale" correlation that postgres computes from block > sample fails to well represent small-scale uncorrelated reads which contribute > large number of random reads, but not included in planner cost. All of that may well be true, but I've actually come around to the view that we should treat TID as a first class part of the keyspace, that participates in comparisons as an implicit last column in all cases (not just when B-Trees are built within tuplesort.c). That would probably obviate the need for a patch like this entirely, because pg_stats.correlation would be unaffected by duplicates. I think that this would have a number of advantages, some of which might be significant. For example, I think it could make LP_DEAD cleanup within nbtree more effective for some workloads, especially workloads where it is important for HOT pruning and LP_DEAD cleanup to work in concert -- locality of access probably matters with that. Also, with every entry in the index guaranteed to be unique, we can imagine VACUUM being much more selective with killing index entries, when the TID array it uses happens to be very small. With the freeze map stuff that's now in place, being able to do that matters more than before. The original Lehman & Yao algorithm is supposed to have unique keys in all cases, but we don't follow that in order to make duplicates work, which is implemented by changing an invariant (see nbtree README for details). So, this could simplify some aspects of how binary searches must work in the face of having to deal with non-unique keys. I think that there are cases where many non-HOT UPDATEs have to go through a bunch of duplicate leaf tuples and do visibility checking on old versions for no real benefit. With the TID as part of the keyspace, we could instead tell the B-Tree code to insert a new tuple as part of an UPDATE while using the TID as part of its insertion scan key, so rummaging through many duplicates is avoided. That having been said, it would be hard to do this for all the reasons I went into in that thread you referenced [1]. If you're going to treat TID as a part of the keyspace, you have to totally embrace that, which means that the internal pages need to have heap TIDs too (not just pointers to the lower levels in the B-Tree, which the IndexTuple's t_tid pointer is used for there). Those are the place where you need to literally append this new, implicit heap TID column as if it was just another user-visible attribute, since that information isn't stored in the internal pages today at all. Doing that has a cost, which isn't going to be acceptable if we naively append a heap TID to every internal page IndexTuple. With a type like int4, you're going to completely bloat those pages, with big consequences for fan-in. So, really, what needs to happen to make it work is someone needs to write a suffix truncation patch, which implies that only those cases that actually benefit from increasing the width of internal page IndexTuples (those with many "would-be duplicates") pay the cost. This is a classic technique, that I've actually already prototyped, though that's extremely far from being worth posting here. That was just to verify my understanding. I think that I should write and put up for discussion a design document for various nbtree enhancements. These include internal page suffix truncation, prefix compression, and key normalization. I'm probably not going to take any of these projects on, but it would be useful if there was at least a little bit of clarity about how they could be implemented. Maybe we could reach some kind of weak consensus without going to great lengths. These optimizations are closely intertwined things, and the lack of clarity on how they all fit together is probably holding back an implementation of any one of them. [1] https://www.postgresql.org/message-id/flat/20160524173914.GA11880%40telsasoft.com#20160524173914.GA11880@xxxxxxxxxxxxx -- Peter Geoghegan -- Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance