Re: lru_multiplier and backend page write-outs

Peter Schuller <peter.schuller@xxxxxxxxxxxx> · Fri, 7 Nov 2008 00:04:01 +0100

> > no table was ever large enough that 256k buffers would ever be filled by 
> > the process of vacuuming a single table.
> 
> Not 256K buffers--256K, 32 buffers.

Ok. 

> > In addition, when I say "constantly" above I mean that the count
> > increases even between successive SELECT:s (of the stat table) with
> > only a second or two in between.
> 
> Writes to the database when only doing read operations are usually related 
> to hint bits:  http://wiki.postgresql.org/wiki/Hint_Bits

Sorry, I didn't mean to imply read-only operations (I did read the
hint bits information a while back though). What I meant was that
while I was constantly generating the insert/delete/update activity, I
was selecting the bg writer stats with only a second or two in
between. The intent was to convey that the count of backend written
pages was systematically and constantly (as in a few hundreds per
handful of seconds) increasing, in spite of no long running vacuum and
the buffer cache not being close to full.

> > On this topic btw, was it considered to allow the administrator to
> > specify a fixed-size margin to use when applying the JIT policy?
> 
> Right now, there's no way to know exactly what's in the buffer cache 
> without scanning the individual buffers, which requires locking their 
> headers so you can see them consistently.  No one process can get the big 
> picture without doing something intrusive like that, and on a busy system 
> the overhead of collecting more data to know how exactly far ahead the 
> cleaning is can drag down overall performance.  A lot can happen while the 
> background writer is sleeping.

Understood.

> One next-generation design which has been sketched out but not even 
> prototyped would take cleaned buffers and add them to the internal list of 
> buffers that are free, which right now is usually empty on the theory that 
> cached data is always more useful than a reserved buffer.  If you 
> developed a reasonable model for how many buffers you needed and padded 
> that appropriately, that's the easiest way (given the rest of the buffer 
> manager code) to get close to ensuring there aren't any backend writes. 
> Because you've got the OS buffering writes anyway in most cases, it's hard 
> to pin down whether that actually improved worst-case latency though. And 
> moving in that direction always seems to reduce average throughput even in 
> write-heavy benchmarks.

Ok.

> The important thing to remember is that the underlying OS has its own read 
> and write caching mechanisms here, and unless the PostgreSQL ones are 
> measurably better than those you might as well let the OS manage the 
> problem instead.

The problem though is that though the OS may be good in the common
cases it is designed for, it can have specific features that are
directly counter-productive if your goals do not line up with that of
the commonly designed-for use case (in particular, if you care about
latency a lot and not necessarily about absolute max throughput).

For example, in Linux up until recently if not still, there is the
1024 per-inode buffer limit that limited the number of buffers written
as a result of expiry, which means that when PostgreSQL does its
fsync(), you may end up having a lot more to write out than what would
have been the case if the centisecs_expiry had been enforced,
regardless of whether PostgreSQL was tuned to write dirty pages out
sufficiently aggressively. If the amount built up exceeds the capacity
of the RAID controller cache...

I had a case where I suspect this was exaserbating the
situation. Manually doing a 'sync' on the system every few seconds
noticably helped (the theory being, because it forced page write-outs
to happen earlier and in smaller storms).

>  It's easy to demonstrate that's happening when you give 
> a decent amount of memory to shared_buffers, it's much harder to prove 
> that's the case for an improved write scheduling algorithm.  Stepping back 
> a bit, you might even consider that one reason PostgreSQL has grown as 
> well as it has in scalability is exactly because it's been riding 
> improvements the underlying OS in many of these cases, rather than trying 
> to do all the I/O scheduling itself.

Sure. In this case with the backend writes, I am nore interesting in
understanding better what is happening and having better indications
of when backends block on I/O, than necessarily having a proven
improvement in throughput or latency. It makes it easier to reason
about what is happening when you *do* have a measured performance
problem.

Thanks for all the insightful information.

-- 
/ Peter Schuller

PGP userID: 0xE9758B7D or 'Peter Schuller <peter.schuller@xxxxxxxxxxxx>'
Key retrieval: Send an E-Mail to getpgpkey@xxxxxxxxx
E-Mail: peter.schuller@xxxxxxxxxxxx Web: http://www.scode.org

Attachment:
pgp2o9FFlxoC2.pgp

Description: PGP signature