Re: [PATCH v7 0/5] vfs: Non-blockling buffered fs read (page cache only)

Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> · Mon, 30 Mar 2015 13:26:25 -0700

On Mon, 30 Mar 2015 00:36:04 -0700 Christoph Hellwig <hch@xxxxxxxxxxxxx> wrote:

> On Fri, Mar 27, 2015 at 08:58:54AM -0700, Jeremy Allison wrote:
> > The problem with the above is that we can't tell the difference
> > between pread2() returning a short read because the pages are not
> > in cache, or because someone truncated the file. So we need some
> > way to differentiate this.
> 
> Is a race vs truncate really that time critical that you can't
> wait for the thread pool to do the second read to notice it?
> 
> > My preference from userspace would be for pread2() to return
> > EAGAIN if *all* the data requested is not available (where
> > 'all' can be less than the size requested if the file has
> > been truncated in the meantime).
> 
> That is easily implementable, but I can see that for example web apps
> would be happy to get as much as possible.  So if Samba can be ok
> with short reads and only detecting the truncated case in the slow
> path that would make life simpler.  Otherwise we might indeed need two
> flags.

The problem is that many applications (including samba!) want
all-or-nothing behaviour, and preadv2() cannot provide it.  By the time
preadv2() discovers a not-present page, it has already copied bulk data
out to userspace.

To fix this, preadv2() would need to take two passes across the pages,
pinning them in between and somehow blocking out truncate.  That's a
big change.

With the current preadv2(), applications would have to do

	nr_read = preadv2(..., offset, len, ...);
	if (nr_read == len)
		process data;
	else
		punt(offset + nr_read, len - nr_read);

and the worker thread will later have to splice together the initial
data and the later-arriving data, probably on another CPU, probably
after the initial data has gone cache-cold.

A cleaner solution is

	if (fincore(fd, NULL, offset, len) == len) {
		preadv(..., offset, len);
		process data;
	} else {
		punt(offset, len);
	}

This way all the data gets copied in a single hit and is cache-hot when
userspace processes it.

Comparing fincore()+pread() to preadv2():

pros:

a) fincore() may be used to provide both all-or-nothing and
   part-read-ok behaviour cleanly and with optimum cache behaviour.

b) fincore() doesn't add overhead, complexity and stack depth to
   core pagecache read() code.  Nor does it expand VFS data structures.

c) with a non-NULL second argument, fincore provides the
   mincore()-style page map.

cons:

d) fincore() is more expensive

e) fincore() will very occasionally block

Tradeoffs are involved.  To decide on the best path we should examine
d).  I expect that the overhead will be significant for small reads but
not significant for medium and large reads.  Needs quantifying.

And I don't believe that e) will be a problem in the real world.  It's
a significant increase in worst-case latency and a negligible increase
in average latency.  I've asked at least three times for someone to
explain why this is unacceptable and no explanation has been provided.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html