> > no table was ever large enough that 256k buffers would ever be filled by > > the process of vacuuming a single table. > > Not 256K buffers--256K, 32 buffers. Ok. > > In addition, when I say "constantly" above I mean that the count > > increases even between successive SELECT:s (of the stat table) with > > only a second or two in between. > > Writes to the database when only doing read operations are usually related > to hint bits: http://wiki.postgresql.org/wiki/Hint_Bits Sorry, I didn't mean to imply read-only operations (I did read the hint bits information a while back though). What I meant was that while I was constantly generating the insert/delete/update activity, I was selecting the bg writer stats with only a second or two in between. The intent was to convey that the count of backend written pages was systematically and constantly (as in a few hundreds per handful of seconds) increasing, in spite of no long running vacuum and the buffer cache not being close to full. > > On this topic btw, was it considered to allow the administrator to > > specify a fixed-size margin to use when applying the JIT policy? > > Right now, there's no way to know exactly what's in the buffer cache > without scanning the individual buffers, which requires locking their > headers so you can see them consistently. No one process can get the big > picture without doing something intrusive like that, and on a busy system > the overhead of collecting more data to know how exactly far ahead the > cleaning is can drag down overall performance. A lot can happen while the > background writer is sleeping. Understood. > One next-generation design which has been sketched out but not even > prototyped would take cleaned buffers and add them to the internal list of > buffers that are free, which right now is usually empty on the theory that > cached data is always more useful than a reserved buffer. If you > developed a reasonable model for how many buffers you needed and padded > that appropriately, that's the easiest way (given the rest of the buffer > manager code) to get close to ensuring there aren't any backend writes. > Because you've got the OS buffering writes anyway in most cases, it's hard > to pin down whether that actually improved worst-case latency though. And > moving in that direction always seems to reduce average throughput even in > write-heavy benchmarks. Ok. > The important thing to remember is that the underlying OS has its own read > and write caching mechanisms here, and unless the PostgreSQL ones are > measurably better than those you might as well let the OS manage the > problem instead. The problem though is that though the OS may be good in the common cases it is designed for, it can have specific features that are directly counter-productive if your goals do not line up with that of the commonly designed-for use case (in particular, if you care about latency a lot and not necessarily about absolute max throughput). For example, in Linux up until recently if not still, there is the 1024 per-inode buffer limit that limited the number of buffers written as a result of expiry, which means that when PostgreSQL does its fsync(), you may end up having a lot more to write out than what would have been the case if the centisecs_expiry had been enforced, regardless of whether PostgreSQL was tuned to write dirty pages out sufficiently aggressively. If the amount built up exceeds the capacity of the RAID controller cache... I had a case where I suspect this was exaserbating the situation. Manually doing a 'sync' on the system every few seconds noticably helped (the theory being, because it forced page write-outs to happen earlier and in smaller storms). > It's easy to demonstrate that's happening when you give > a decent amount of memory to shared_buffers, it's much harder to prove > that's the case for an improved write scheduling algorithm. Stepping back > a bit, you might even consider that one reason PostgreSQL has grown as > well as it has in scalability is exactly because it's been riding > improvements the underlying OS in many of these cases, rather than trying > to do all the I/O scheduling itself. Sure. In this case with the backend writes, I am nore interesting in understanding better what is happening and having better indications of when backends block on I/O, than necessarily having a proven improvement in throughput or latency. It makes it easier to reason about what is happening when you *do* have a measured performance problem. Thanks for all the insightful information. -- / Peter Schuller PGP userID: 0xE9758B7D or 'Peter Schuller <peter.schuller@xxxxxxxxxxxx>' Key retrieval: Send an E-Mail to getpgpkey@xxxxxxxxx E-Mail: peter.schuller@xxxxxxxxxxxx Web: http://www.scode.org
Attachment:
pgp2o9FFlxoC2.pgp
Description: PGP signature