On Tue, 26 Feb 2008 15:07:45 +0000 Jamie Lokier <jamie@xxxxxxxxxxxxx> wrote: > SYNC_FILE_RANGE_WRITE scans all pages in the range, looking for dirty > pages which aren't already queued for write-out. It marks those with > a "write-out" flag, and starts write I/Os at some unspecified time in > the near future; it can be assumed writes for all the pages will > complete eventually if there's no errors. When I/O completes on a > page, it cleans the page and also clears the write-out flag. > > SYNC_FILE_RANGE_WAIT_AFTER waits until all pages in the range don't > have the "write-out" flag set. > > SYNC_FILE_RANGE_WAIT_BEFORE does the same wait, but before marking > pages for write-out. I don't actually see the point in this. Isn't a > preceding call with SYNC_FILE_RANGE_WAIT_AFTER equivalent, making > BEFORE a redundant flag? Consider the case of pages which are dirty but are already under writeout. ie: someone redirtied the page after someone started writing the page out. For these pages the kernel needs to a) wait for the current writeout to complete b) start new writeout c) wait for that writeout to complete. those are the three stages of sync_file_range(). They are independently selectable and various combinations provide various results. The reason for providing b) only (SYNC_FILE_RANGE_WRITE) is so that userspace can get as much data into the queue as possible, to permit the kernel to optimise IO scheduling better. If you perform a) and b) together (SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE) then you are guaranteed that all data which was dirty when sync_file_range() executed will be sent into the queue, but you won't get as much data into the queue if the kernel encounters dirty, under-writeout pages. This is especially hurtful if you're trying to feed a lot of little segments into the queue. In that case perhaps userspace should do an asynchrnous pass (SYNC_FILE_RANGE_WRITE) to stuff as much data as poss into the queue, then a SYNC_FILE_RANGE_WAIT_AFTER pass then a SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE|SYNC_FILE_RANGE_WAIT_AFTER pass to clean up any stragglers. WHich mode is best very much depends on the application's file dirtying patterns. One would have to experiment with it, and tuning of sync_file_range() usage would occur alongside tuning of the application's write() design. It's an interesting problem, with potentially high payback. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html