On 2015-10-29 12:54:22 +1100, Dave Chinner wrote: > On Thu, Oct 29, 2015 at 12:23:12AM +0100, Andres Freund wrote: > > The blocking/latency of the fsync doesn't actually matter at all *for > > this callsite*. It's called from a dedicated background process - if > > it's slowed down by a couple seconds it doesn't matter much. > > The problem is that if you have a couple gigabytes of dirty data being > > fsync()ed at once, latency for concurrent reads and writes often goes > > absolutely apeshit. And those concurrent reads and writes might > > actually be latency sensitive. > > Right, but my point is with an async fsync/fdatasync you don't need > this background process - you can just trickle out async fdatasync > calls instead of trckling out calls to sync_file_range(). We don't want to do the checkpointing from normal backends that process user queries, so there has to be a background process anyway. Depending on settings we only do the checkpoints in 5 to 60 minutes intervals (spread over that interval). > > By calling sync_file_range() over small ranges of pages shortly after > > they've been written we make it unlikely (but still possible) that much > > data has to be flushed at fsync() time. > > Right, but you still need the fsync call, whereas with a async fsync > call you don't - when you gather the completion, no further action > needs to be taken on that dirty range. I assume that the actual IOs issued by the async fsync and a plain fsync would be pretty similar. So the problem that an fsync of large amounts of dirty data causes latency increases for other issuers of IO wouldn't be gone, no? > > At the moment using fdatasync() instead of fsync() is a considerable > > performance advantage... If I understand the above proposal correctly, > > it'd allow specifying ranges, is that right? > > Well, the patch I sent doesn't do ranges, but it could easily be > passed in as the iocb has offset/len parameters that are used by > IOCB_CMD_PREAD/PWRITE. That'd be cool. Then we could issue those for asynchronous transaction commits, and to have more wal writes concurrently in progress by the background wal writer. I'll try the patch from 20151028232641.GS8773@dastard and see wether I can make it be advantageous for throughput (for WAL flushing, not the checkpointer process). Wish I had a better storage system, my guess it'll be more advantageous there. We'll see. Greetings, Andres Freund -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>