Re: How does PG know if data is in memory?

<gnuoytr@xxxxxxx> · Mon, 11 Oct 2010 23:11:34 -0400 (EDT)

An approach that works can be found in DB2, and likely elsewhere.  

The key is that tablespaces/tables/indexes/buffers are all attached through the bufferpool (the DB2 term).  A tablespace/bufferpool match is defined.  Then tables and indexes are assigned to the tablespace (and implicitly, the bufferpool).  As a result, one can effectively pin data in memory.  This is very useful, but not low hanging fruit to implement.

The introduction of rudimentary tablespaces is a first step.  I assumed that the point was to get to a DB2-like structure at some point.  Yes?

Robert

---- Original message ----
>Date: Mon, 11 Oct 2010 22:59:28 -0400
>From: pgsql-performance-owner@xxxxxxxxxxxxxx (on behalf of Robert Haas <robertmhaas@xxxxxxxxx>)
>Subject: Re:  How does PG know if data is in memory?  
>To: Jeremy Harris <jgh@xxxxxxxxxxx>
>Cc: pgsql-performance@xxxxxxxxxxxxxx
>
>On Mon, Oct 4, 2010 at 6:47 PM, Jeremy Harris <jgh@xxxxxxxxxxx> wrote:
>> On 10/04/2010 04:22 AM, Greg Smith wrote:
>>>
>>> I had a brain-storming session on this subject with a few of the hackers
>>> in the community in this area a while back I haven't had a chance to do
>>> something with yet (it exists only as a pile of scribbled notes so far).
>>> There's a couple of ways to collect data on what's in the database and OS
>>> cache, and a couple of ways to then expose that data to the optimizer. But
>>> that needs to be done very carefully, almost certainly as only a manual
>>> process at first, because something that's producing cache feedback all of
>>> the time will cause plans to change all the time, too. Where I suspect this
>>> is going is that we may end up tracking various statistics over time, then
>>> periodically providing a way to export a mass of "typical % cached" data
>>> back to the optimizer for use in plan cost estimation purposes. But the idea
>>> of monitoring continuously and always planning based on the most recent data
>>> available has some stability issues, both from a "too many unpredictable
>>> plan changes" and a "ba
>>
>> d
>>>
>>> short-term feedback loop" perspective, as mentioned by Tom and Kevin
>>> already.
>>
>> Why not monitor the distribution of response times, rather than "cached" vs.
>> not?
>>
>> That a) avoids the issue of discovering what was a cache hit  b) deals
>> neatly with
>> multilevel caching  c) feeds directly into cost estimation.
>
>I was hot on doing better cache modeling a year or two ago, but the
>elephant in the room is that it's unclear that it solves any
>real-world problem.  The OP is clearly having a problem, but there's
>not enough information in his post to say what is actually causing it,
>and it's probably not caching effects.  We get occasional complaints
>of the form "the first time I run this query it's slow, and then after
>that it's fast" but, as Craig Ringer pointed out upthread, not too
>many.  And even with respect to the complaints we do get, it's far
>from clear that the cure is any better than the disease.  Taking
>caching effects into account could easily result in the first
>execution being slightly less slow and all of the subsequent
>executions being moderately slow.  That would not be an improvement
>for most people.  The reports that seem really painful to me are the
>ones where people with really big machines complain of needing HOURS
>for the cache to warm up, and having the system bogged down to a
>standstill until then.  But changing the cost model isn't going to
>help them either.
>
>-- 
>Robert Haas
>EnterpriseDB: http://www.enterprisedb.com
>The Enterprise PostgreSQL Company
>
>-- 
>Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
>To make changes to your subscription:
>http://www.postgresql.org/mailpref/pgsql-performance

-- 
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance