Re: Triggering non-integrity writeback from userspace

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



  Hi,

On Thu 22-10-15 15:15:55, Andres Freund wrote:
> postgres regularly has to checkpoint data to disk to be able to free
> data from its journal. We currently use buffered IO and that's not
> going to change short term.
> 
> In a busy database this checkpointing process can write out a lot of
> data. Currently that frequently leads to massive latency spikes
> (c.f. 20140326191113.GF9066@xxxxxxxxxxxxxxxxx) for other processed doing
> IO. These happen either when the kernel starts writeback or when, at the
> end of the checkpoint, we issue an fsync() on the datafiles.
> 
> One odd issue there is that the kernel tends to do writeback in a very
> irregular manner. Even if we write data at a constant rate writeback
> very often happens in bulk - not a good idea for preserving
> interactivity.
> 
> What we're preparing to do now is to regularly issue
> sync_file_range(SYNC_FILE_RANGE_WRITE) on a few blocks shortly after
> we've written them to to the OS. That way there's not too much dirty
> data in the page cache, so writeback won't cause latency spikes, and the
> fsync at the end doesn't have to write much if anything.
> 
> That improves things a lot.
> 
> But I still see latency spikes that shouldn't be there given the amount
> of IO. I'm wondering if that is related to the fact that
> SYNC_FILE_RANGE_WRITE ends up doing __filemap_fdatawrite_range with
> WB_SYNC_ALL specified. Given the the documentation for
> SYNC_FILE_RANGE_WRITE I did not expect that:
>  * SYNC_FILE_RANGE_WRITE: start writeout of all dirty pages in the range which
>  * are not presently under writeout.  This is an asynchronous flush-to-disk
>  * operation.  Not suitable for data integrity operations.
> 
> If I followed the code correctly - not a sure thing at all - that means
> bios are submitted with WRITE_SYNC specified. Not really what's needed
> in this case.
> 
> Now I think the docs are somewhat clear that SYNC_FILE_RANGE_WRITE isn't
> there for data integrity, but it might be that people rely on in
> nonetheless. so I'm loathe to suggest changing that. But I do wonder if
> there's a way non-integrity writeback triggering could be exposed to
> userspace. A new fadvise flags seems like a good way to do that -
> POSIX_FADV_DONTNEED actually does non-integrity writeback, but also does
> other things, so it's not suitable for us.

You are absolutely correct that sync_file_range() should issue writeback as
WB_SYNC_NONE and not wait for current writeback in progress. That was an
oversight introduced by commit ee53a891f474 (mm: do_sync_mapping_range
integrity fix) which changed do_sync_mapping_range() to use WB_SYNC_ALL
because it had other users which relied WB_SYNC_ALL semantics. Later that
got copied over to the current sync_file_range() implementation.

I think we should just revert to the very explicitely documented behavior
of sync_file_range(). I'll send a patch for that. Thanks for report.

								Honza
-- 
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>



[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux]     [Linux OMAP]     [Linux MIPS]     [ECOS]     [Asterisk Internet PBX]     [Linux API]