Hi, On Thu 22-10-15 15:15:55, Andres Freund wrote: > postgres regularly has to checkpoint data to disk to be able to free > data from its journal. We currently use buffered IO and that's not > going to change short term. > > In a busy database this checkpointing process can write out a lot of > data. Currently that frequently leads to massive latency spikes > (c.f. 20140326191113.GF9066@xxxxxxxxxxxxxxxxx) for other processed doing > IO. These happen either when the kernel starts writeback or when, at the > end of the checkpoint, we issue an fsync() on the datafiles. > > One odd issue there is that the kernel tends to do writeback in a very > irregular manner. Even if we write data at a constant rate writeback > very often happens in bulk - not a good idea for preserving > interactivity. > > What we're preparing to do now is to regularly issue > sync_file_range(SYNC_FILE_RANGE_WRITE) on a few blocks shortly after > we've written them to to the OS. That way there's not too much dirty > data in the page cache, so writeback won't cause latency spikes, and the > fsync at the end doesn't have to write much if anything. > > That improves things a lot. > > But I still see latency spikes that shouldn't be there given the amount > of IO. I'm wondering if that is related to the fact that > SYNC_FILE_RANGE_WRITE ends up doing __filemap_fdatawrite_range with > WB_SYNC_ALL specified. Given the the documentation for > SYNC_FILE_RANGE_WRITE I did not expect that: > * SYNC_FILE_RANGE_WRITE: start writeout of all dirty pages in the range which > * are not presently under writeout. This is an asynchronous flush-to-disk > * operation. Not suitable for data integrity operations. > > If I followed the code correctly - not a sure thing at all - that means > bios are submitted with WRITE_SYNC specified. Not really what's needed > in this case. > > Now I think the docs are somewhat clear that SYNC_FILE_RANGE_WRITE isn't > there for data integrity, but it might be that people rely on in > nonetheless. so I'm loathe to suggest changing that. But I do wonder if > there's a way non-integrity writeback triggering could be exposed to > userspace. A new fadvise flags seems like a good way to do that - > POSIX_FADV_DONTNEED actually does non-integrity writeback, but also does > other things, so it's not suitable for us. You are absolutely correct that sync_file_range() should issue writeback as WB_SYNC_NONE and not wait for current writeback in progress. That was an oversight introduced by commit ee53a891f474 (mm: do_sync_mapping_range integrity fix) which changed do_sync_mapping_range() to use WB_SYNC_ALL because it had other users which relied WB_SYNC_ALL semantics. Later that got copied over to the current sync_file_range() implementation. I think we should just revert to the very explicitely documented behavior of sync_file_range(). I'll send a patch for that. Thanks for report. Honza -- Jan Kara <jack@xxxxxxxx> SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html