2011/5/19 Jim Nasby <jim@xxxxxxxxx>: > On May 19, 2011, at 9:53 AM, Robert Haas wrote: >> On Wed, May 18, 2011 at 11:00 PM, Greg Smith <greg@xxxxxxxxxxxxxxx> wrote: >>> Jim Nasby wrote: >>>> I think the challenge there would be how to define the scope of the >>>> hot-spot. Is it the last X pages? Last X serial values? Something like >>>> correlation? >>>> >>>> Hmm... it would be interesting if we had average relation access times for >>>> each stats bucket on a per-column basis; that would give the planner a >>>> better idea of how much IO overhead there would be for a given WHERE clause >>> >>> You've already given one reasonable first answer to your question here. If >>> you defined a usage counter for each histogram bucket, and incremented that >>> each time something from it was touched, that could lead to a very rough way >>> to determine access distribution. Compute a ratio of the counts in those >>> buckets, then have an estimate of the total cached percentage; multiplying >>> the two will give you an idea how much of that specific bucket might be in >>> memory. It's not perfect, and you need to incorporate some sort of aging >>> method to it (probably weighted average based), but the basic idea could >>> work. >> >> Maybe I'm missing something here, but it seems like that would be >> nightmarishly slow. Every time you read a tuple, you'd have to look >> at every column of the tuple and determine which histogram bucket it >> was in (or, presumably, which MCV it is, since those aren't included >> in working out the histogram buckets). That seems like it would slow >> down a sequential scan by at least 10x. > > You definitely couldn't do it real-time. But you might be able to copy the tuple somewhere and have a background process do the analysis. > > That said, it might be more productive to know what blocks are available in memory and use correlation to guesstimate whether a particular query will need hot or cold blocks. Or perhaps we create a different structure that lets you track the distribution of each column linearly through the table; something more sophisticated than just using correlation.... perhaps something like indicating which stats bucket was most prevalent in each block/range of blocks in a table. That information would allow you to estimate exactly what blocks in the table you're likely to need... Those are very good ideas I would get in mind for vacuum/checkpoint tasks: if you are able to know hot and cold data, then order it in the segments of the relation. But making it work at the planner level looks hard. I am not opposed to the idea, but no idea how to do it right now. > -- > Jim C. Nasby, Database Architect jim@xxxxxxxxx > 512.569.9461 (cell) http://jim.nasby.net > > > > -- > Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance > -- Cédric Villemain 2ndQuadrant http://2ndQuadrant.fr/ ; PostgreSQL : Expertise, Formation et Support -- Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance