Re: Triggering non-integrity writeback from userspace

Andres Freund <andres@xxxxxxxxxxx> · Wed, 28 Oct 2015 10:27:52 +0100

Hi,

Thanks for looking into this.

On 2015-10-25 08:39:12 +1100, Dave Chinner wrote:
> WB_SYNC_ALL is simply a method of saying "writeback all dirty pages
> and don't skip any". That's part of a data integrity operation, but
> it's not what results in data integrity being provided. It may cause
> some latencies caused by blocking on locks or in the request queues,
> so that's what I'd be looking for.

It also means we'll wait for more:
int write_cache_pages(struct address_space *mapping,
		      struct writeback_control *wbc, writepage_t writepage,
		      void *data)
{
...
	if (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages)
		tag = PAGECACHE_TAG_TOWRITE;
	else
		tag = PAGECACHE_TAG_DIRTY;
...
			if (PageWriteback(page)) {
				if (wbc->sync_mode != WB_SYNC_NONE)
					wait_on_page_writeback(page);
				else
					goto continue_unlock;
			}

> i.e. if the request queues are full, SYNC_FILE_RANGE_WRITE will
> block until all the IO it has been requested to write has been
> submitted to the request queues. Put simply: the IO is asynchronous
> in that we don't wait for completion, but the IO submission is still
> synchronous.

That's desirable in our case because there's a limit to how much
outstanding IO there is.

> Data integrity operations require related file metadata (e.g. block
> allocation trnascations) to be forced to the journal/disk, and a
> device cache flush issued to ensure the data is on stable storage.
> SYNC_FILE_RANGE_WRITE does neither of these things, and hence while
> the IO might be the same pattern as a data integrity operation, it
> does not provide such guarantees.

Which is desired here - the actual integrity is still going to be done
via fsync(). The idea of using SYNC_FILE_RANGE_WRITE beforehand is that
the fsync() will only have to do very little work. The language in
sync_file_range(2) doesn't inspire enough confidence for using it as an
actual integrity operation :/

> > If I followed the code correctly - not a sure thing at all - that means
> > bios are submitted with WRITE_SYNC specified. Not really what's needed
> > in this case.
>
> That just allows the IO scheduler to classify them differently to
> bulk background writeback.

It also influences which writes are merged and which are not, at least
if I understand elv_rq_merge_ok() and the callbacks it calls..

> You don't want to do writeback from the syscall, right? i.e. you'd
> like to expire the inode behind the fd, and schedule background
> writeback to run on it immediately?

Yes, that's exactly what we want. Blocking if a process has done too
much writes is fine tho.

Greetings,

Andres Freund

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>