Hi, postgres regularly has to checkpoint data to disk to be able to free data from its journal. We currently use buffered IO and that's not going to change short term. In a busy database this checkpointing process can write out a lot of data. Currently that frequently leads to massive latency spikes (c.f. 20140326191113.GF9066@xxxxxxxxxxxxxxxxx) for other processed doing IO. These happen either when the kernel starts writeback or when, at the end of the checkpoint, we issue an fsync() on the datafiles. One odd issue there is that the kernel tends to do writeback in a very irregular manner. Even if we write data at a constant rate writeback very often happens in bulk - not a good idea for preserving interactivity. What we're preparing to do now is to regularly issue sync_file_range(SYNC_FILE_RANGE_WRITE) on a few blocks shortly after we've written them to to the OS. That way there's not too much dirty data in the page cache, so writeback won't cause latency spikes, and the fsync at the end doesn't have to write much if anything. That improves things a lot. But I still see latency spikes that shouldn't be there given the amount of IO. I'm wondering if that is related to the fact that SYNC_FILE_RANGE_WRITE ends up doing __filemap_fdatawrite_range with WB_SYNC_ALL specified. Given the the documentation for SYNC_FILE_RANGE_WRITE I did not expect that: * SYNC_FILE_RANGE_WRITE: start writeout of all dirty pages in the range which * are not presently under writeout. This is an asynchronous flush-to-disk * operation. Not suitable for data integrity operations. If I followed the code correctly - not a sure thing at all - that means bios are submitted with WRITE_SYNC specified. Not really what's needed in this case. Now I think the docs are somewhat clear that SYNC_FILE_RANGE_WRITE isn't there for data integrity, but it might be that people rely on in nonetheless. so I'm loathe to suggest changing that. But I do wonder if there's a way non-integrity writeback triggering could be exposed to userspace. A new fadvise flags seems like a good way to do that - POSIX_FADV_DONTNEED actually does non-integrity writeback, but also does other things, so it's not suitable for us. Greetings, Andres Freund -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html