Re: Limit of bgwriter_lru_maxpages of max. 1000?

Gerhard Wiesinger <lists@xxxxxxxxxxxxx> · Mon, 5 Oct 2009 20:28:25 +0200 (CEST)

On Mon, 5 Oct 2009, Greg Smith wrote:

On Sun, 4 Oct 2009, Gerhard Wiesinger wrote:

On Fri, 2 Oct 2009, Scott Marlowe wrote:

I found that lowering checkpoint completion target was what helped.
Does that seem counter-intuitive to you?

I set it to 0.0 now.

If you set that to 0.0, the whole checkpoing spreading logic doesn't apply 
like it's supposed to.  I'm not sure what the results you posted mean now. If 
you had it set to 0 and saw a bad spike (which is how I read your message), 
I'd say "yes, that's what happens when you do reduce that parameter, so don't 
do that".  If you meant something else please clarify.

I think the problem is, that it is done on checkpoint time (whether spread 
or not). I should have been already be done by bgwriter.

Thanks for the dtrace example, I suggested we add those checkpoint probes in 
there and someone did, but I hadn't seen anybody use them for anything yet.

I think more probes (e.g. on different writing conditions like writing 
from bgwriter or on a checkpoint) would be interesting here.

Bug1: usage_count is IHMO not consistent

It's a bit hack-ish, but the changes made to support multiple buffer use 
strategies introduced by the "Make large sequential scans and VACUUMs work in 
a limited-size ring" commit are reasonable even if they're not as consistent 
as we'd like.  Those changes were supported by benchmarks proving their 
utility, which always trump theoretical "that shouldn't work better!" claims 
when profiling performance.

Also, they make sense to me, but I've spent a lot of time staring at 
pg_buffercache output to get a feel for what shows up in there under various 
circumstances.  That's where I'd suggest you go if this doesn't seem right to 
you; run some real database tests and use pg_buffercache to see what's inside 
the cache when you're done.  What's in there and what I expected to be in 
there weren't always the same thing, and it's interesting to note how that 
changes as shared_buffers increases.  I consider some time studying that a 
pre-requisite to analyzing performance of this code.

I have analyzed pg_buffercache (query every second, see below) in parallel 
to see what happens but I didn't see expected results in some ways with 
the usage_counts. Therefore I analyzed the code and found IHMO the problem 
with the usage_count and buffer reallocation. Since the code change is 
also new (I think it way 05/2009) it might be that you tested before ...

BTW: Is it possible to get everything in pg_class over all databases as 
admin?

Bug2: Double iteration of buffers
As you can seen in the calling tree below there is double iteration with 
buffers involved. This might be a major performance bottleneck.

Hmmm, this might be a real bug causing scans through the buffer cache to go 
twice as fast as intended.

That's not twice O(2*n)=O(n) that's a factor n*n (outer and inner loop 
iteration) which means overall is O(n^2) which is IHMO too much.

 Since the part you suggest is doubled isn't very 
intensive or called all that often, there's no way it can be a major issue 
though.

It is a major issue since it is O(n^2) and not O(n). E.g. with 2GB share 
buffer we have 262144 blocks and 68719476736 calls which is far too much.

 That's based on knowing what the code does and how much it was 
called, as well as some confidence that if it were really a *major* problem, 
it would have shown up on the extensive benchmarks done on all the code paths 
you're investigating.

The problem might be hidden for the following reasons:
1.) Buffers values are too low that even n^2 is low for today's machines
2.) Code is not often called in that way
3.) backend writes out pages so that the code is never executed
4.) ...

BTW: Are there some tests available how fast a buffer cache hit is and a 
disk cache hit is (not in the buffer cache but in the disk cache)? I'll 
asked, because a lot of locking is involved in the code.

I did some once but didn't find anything particularly interesting about the 
results.  Since you seem to be on a research tear here, it would be helpful 
to have a script to test that out available, I wasn't able to release mine 
and something dtrace based would probably be better than the approach I used 
(I threw a bunch of gettimeofdata calls into the logs and post-processed them 
with a script).

Do you have an where one should set tracepoints inside and outside 
PostgreSQL?

BTW2: Oracle buffercache and background writer strategy is also 
interesting.

As a rule, we don't post links to other database implementation details here, 
as those can have patented design details we'd prefer not to intentionally 
re-implement.  Much of Oracle's design here doesn't apply here anyway, as it 
was done in the era when all of their writes were synchronous.  That required 
them to worry about doing a good job on some things in their background 
writer that we shrug off and let os writes combined with fsync handle 
instead.

Ok, no problem.

Ciao,
Gerhard

--
http://www.wiesinger.com/

SELECT
  CASE
    WHEN datname IS NULL THEN pg_buffercache.reldatabase::text
    ELSE datname
  END AS database,
  CASE
    WHEN spcname IS NULL THEN pg_buffercache.reltablespace::text
    ELSE spcname
  END AS tablespace,
  CASE
    WHEN relname IS NULL THEN pg_buffercache.relfilenode::text
    ELSE relname
  END AS relation,
  CASE
    WHEN relkind IS NULL THEN pg_buffercache.relfilenode::text
    ELSE relkind
  END AS relkind,
  usagecount > 0 AS usagecount_gt_0,
  isdirty,
  MIN(relblocknumber) AS min_blocknumber,
  MAX(relblocknumber) AS max_blocknumber,
  ROUND(AVG(relblocknumber),2) AS avg_blocknumber,
  ROUND(STDDEV(relblocknumber),2) AS stddev_blocknumber,
  COUNT(*) AS count
FROM
  pg_buffercache
LEFT JOIN pg_class      ON pg_buffercache.relfilenode = pg_class.oid
LEFT JOIN pg_tablespace ON pg_buffercache.reltablespace = 
pg_tablespace.oid
LEFT JOIN pg_database   ON pg_buffercache.reldatabase = pg_database.oid
WHERE
  isdirty = true
GROUP BY
  CASE
    WHEN datname IS NULL THEN pg_buffercache.reldatabase::text
    ELSE datname
  END,
  CASE
    WHEN spcname IS NULL THEN pg_buffercache.reltablespace::text
    ELSE spcname
  END,
  CASE
    WHEN relkind IS NULL THEN pg_buffercache.relfilenode::text
    ELSE relkind
  END,
  CASE
    WHEN relname IS NULL THEN pg_buffercache.relfilenode::text
    ELSE relname
  END,
  usagecount > 0,
  isdirty
--HAVING relkind = 'r'
ORDER BY
  database,
  tablespace,
  relkind DESC,
  relation,
  usagecount > 0,
  isdirty
;

--
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general