Re: Memory usage - indexes

Greg Smith <greg@xxxxxxxxxxxxxxx> · Fri, 24 Sep 2010 14:01:12 -0400

Tobias Brox wrote:
1) Are there any good ways to verify my hypothesis?

You can confim easily whether the contents of the PostgreSQL buffer 
cache contain when you think they do by installing pg_buffercache.  My 
paper and sample samples at 
http://www.pgcon.org/2010/schedule/events/218.en.html go over that.

You can also integrate that with a look at the OS level information by 
using pgfincore:  http://www.pgcon.org/2010/schedule/events/261.en.html

I've found that if shared_buffers is set to a largish size, you can find 
out enough information from look at it to have a decent idea what's 
going on without going to that depth.  But it's available if you want it.

2) I've seen it discussed earlier on this list ... shared_buffers vs
OS caches.  Some claims that it has very little effect to adjust the
size of the shared buffers.  Anyway, isn't it a risk that memory is
wasted because important data is stored both in the OS cache and the
shared buffers?
The risk of overlap is overrated.  What's much more likely to actually 
happen is that you'll have good data in shared_buffers, then run 
something that completely destroys the OS cache (multiple seq scans just 
below the "ring buffer" threshold", multiple large index scans, raging 
VACUUM work).  Having copies of the most important pieces that stay in 
shared_buffers despite the OS cache being demolished is much more 
important to preserving decent performance than the concern about double 
buffering database and OS contents--that only happens on trivial 
workloads where there's not constant churn on the OS cache throwing 
pages out like crazy.

I have easily measurable improvements on client systems increasing 
shared_buffers into the 4GB - 8GB range.  Popular indexes move into 
there, stay there, and only get written out at checkpoint time rather 
than all the time.  However, if you write heavily enough that much of 
this space gets dirty fast, you may not be be able to go that high 
before checkpoint issues start to make such sizes impractical.

What would happen if using almost all the available
memory for shared buffers?  Or turn it down to a bare minimum and let
the OS do almost all the cache handling?

The useful upper limit normally works out to be somewhere between 4GB 
and 1/2 of RAM.  Using minimal values works for some people, 
particularly on Windows, but you can measure that doing so generates far 
more disk I/O activity than using a moderate sized cache by 
instrumenting pg_stat_bgwriter, the way I describe in my talk.

3) We're discussing to drop some overlapping indexes ... i.e. to drop
one out of two indexes looking like this:

some_table(a)
some_table(a,b)

Would the query "select * from some_table where a=?" run slower if we
drop the first index?  Significantly?

Yes, it would run slower, because now it has to sort through blocks in a 
larger index in order to find anything.  How significant that is depends 
on the relative size of the indexes.  To give a simple example, if (a) 
is 1GB, while (a,b) is 2GB, you can expect dropping (a) to halve the 
speed of index lookups.  Fatter indexes just take longer to navigate 
through.

(in our situation I found that the number of distinct b's for each a
is low and that the usage stats on the second index is quite low
compared with the first one, so I think we'll drop the second index).

You are thinking correctly here now.  If the addition of b to the index 
isn't buying you significant increases in selectivity, just get rid of 
it and work only with the index on a instead.

4) We're discussing to drop other indexes.  Does it make sense at all
as long as we're not experiencing problems with inserts/updates?  I
suppose that if the index isn't used it will remain on disk and won't
affect the memory usage ... but what if the index is rarely used ...
wouldn't it be better to do a seqscan on a table that is frequently
accessed and mostly in memory than to consult an index that is stored
on the disk?

Don't speculate; measure the exact usage amount that each usage is being 
used and evaluate them on a case by case basis.  If they're not being 
used, they're just adding overheard in many ways, and you should drop them.

There are a bunch of "find useless index" scripts floating around the 
web (I think I swiped ideas from Robert Treat and Josh Berkus to build 
mine); here's the one I use now:

SELECT
 schemaname as nspname,
 relname,
 indexrelname,
 idx_scan,
 pg_size_pretty(pg_relation_size(i.indexrelid)) AS index_size
FROM
 pg_stat_user_indexes i
 JOIN pg_index USING (indexrelid)
WHERE
 indisunique IS false
ORDER BY idx_scan,pg_relation_size(i.indexrelid) DESC;

Anything that bubbles to the top of that list, you probably want to get 
rid of.  Note that this ignores UNIQUE indexes, which you can't drop 
anyway, but are being used to answer queries.  You might choose to 
include them anyway but just flag them in the output if the goal is to 
see how often they are used.

P.S. You seem busy re-inventing pgstatspack this week:  
http://pgfoundry.org/projects/pgstatspack/ does all of this "take a 
snapshot of the stats and store it in the database for future analysis" 
work for you.  Working on that intead of continuing to hack individual 
storage/retrieve scripts for each statistics counter set would be a 
better contribution to the PostgreSQL community.

--
Greg Smith, 2ndQuadrant US greg@xxxxxxxxxxxxxxx Baltimore, MD
PostgreSQL Training, Services and Support  www.2ndQuadrant.us
Author, "PostgreSQL 9.0 High Performance"    Pre-ordering at:
https://www.packtpub.com/postgresql-9-0-high-performance/book

--
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance