An approach that works can be found in DB2, and likely elsewhere. The key is that tablespaces/tables/indexes/buffers are all attached through the bufferpool (the DB2 term). A tablespace/bufferpool match is defined. Then tables and indexes are assigned to the tablespace (and implicitly, the bufferpool). As a result, one can effectively pin data in memory. This is very useful, but not low hanging fruit to implement. The introduction of rudimentary tablespaces is a first step. I assumed that the point was to get to a DB2-like structure at some point. Yes? Robert ---- Original message ---- >Date: Mon, 11 Oct 2010 22:59:28 -0400 >From: pgsql-performance-owner@xxxxxxxxxxxxxx (on behalf of Robert Haas <robertmhaas@xxxxxxxxx>) >Subject: Re: How does PG know if data is in memory? >To: Jeremy Harris <jgh@xxxxxxxxxxx> >Cc: pgsql-performance@xxxxxxxxxxxxxx > >On Mon, Oct 4, 2010 at 6:47 PM, Jeremy Harris <jgh@xxxxxxxxxxx> wrote: >> On 10/04/2010 04:22 AM, Greg Smith wrote: >>> >>> I had a brain-storming session on this subject with a few of the hackers >>> in the community in this area a while back I haven't had a chance to do >>> something with yet (it exists only as a pile of scribbled notes so far). >>> There's a couple of ways to collect data on what's in the database and OS >>> cache, and a couple of ways to then expose that data to the optimizer. But >>> that needs to be done very carefully, almost certainly as only a manual >>> process at first, because something that's producing cache feedback all of >>> the time will cause plans to change all the time, too. Where I suspect this >>> is going is that we may end up tracking various statistics over time, then >>> periodically providing a way to export a mass of "typical % cached" data >>> back to the optimizer for use in plan cost estimation purposes. But the idea >>> of monitoring continuously and always planning based on the most recent data >>> available has some stability issues, both from a "too many unpredictable >>> plan changes" and a "ba >> >> d >>> >>> short-term feedback loop" perspective, as mentioned by Tom and Kevin >>> already. >> >> Why not monitor the distribution of response times, rather than "cached" vs. >> not? >> >> That a) avoids the issue of discovering what was a cache hit b) deals >> neatly with >> multilevel caching c) feeds directly into cost estimation. > >I was hot on doing better cache modeling a year or two ago, but the >elephant in the room is that it's unclear that it solves any >real-world problem. The OP is clearly having a problem, but there's >not enough information in his post to say what is actually causing it, >and it's probably not caching effects. We get occasional complaints >of the form "the first time I run this query it's slow, and then after >that it's fast" but, as Craig Ringer pointed out upthread, not too >many. And even with respect to the complaints we do get, it's far >from clear that the cure is any better than the disease. Taking >caching effects into account could easily result in the first >execution being slightly less slow and all of the subsequent >executions being moderately slow. That would not be an improvement >for most people. The reports that seem really painful to me are the >ones where people with really big machines complain of needing HOURS >for the cache to warm up, and having the system bogged down to a >standstill until then. But changing the cost model isn't going to >help them either. > >-- >Robert Haas >EnterpriseDB: http://www.enterprisedb.com >The Enterprise PostgreSQL Company > >-- >Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx) >To make changes to your subscription: >http://www.postgresql.org/mailpref/pgsql-performance -- Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance