Re: [PATCH v3 6/6] NFSD: Repeal and replace the READ_PLUS implementation

Chuck Lever III <chuck.lever@xxxxxxxxxx> · Wed, 20 Jul 2022 01:26:13 +0000

> On Jul 19, 2022, at 6:44 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> 
> On Tue, Jul 19, 2022 at 04:46:50PM -0400, Anna Schumaker wrote:
>> On Sun, Jul 17, 2022 at 9:16 PM Dave Chinner <david@xxxxxxxxxxxxx> wrote:
>>> 
>>> On Fri, Jul 15, 2022 at 07:08:13PM +0000, Chuck Lever III wrote:
>>>>> On Jul 15, 2022, at 2:44 PM, Anna Schumaker <anna@xxxxxxxxxx> wrote:
>>>>> 
>>>>> From: Anna Schumaker <Anna.Schumaker@xxxxxxxxxx>
>>>>> 
>>>>> Rather than relying on the underlying filesystem to tell us where hole
>>>>> and data segments are through vfs_llseek(), let's instead do the hole
>>>>> compression ourselves. This has a few advantages over the old
>>>>> implementation:
>>>>> 
>>>>> 1) A single call to the underlying filesystem through nfsd_readv() means
>>>>>  the file can't change from underneath us in the middle of encoding.
>>> 
>>> Hi Anna,
>>> 
>>> I'm assuming you mean the vfs_llseek(SEEK_HOLE) call at the start
>>> of nfsd4_encode_read_plus_data() that is used to trim the data that
>>> has already been read out of the file?
>> 
>> There is also the vfs_llseek(SEEK_DATA) call at the start of
>> nfsd4_encode_read_plus_hole(). They are used to determine the length
>> of the current hole or data segment.
>> 
>>> 
>>> What's the problem with racing with a hole punch here? All it does
>>> is shorten the read data returned to match the new hole, so all it's
>>> doing is making the returned data "more correct".
>> 
>> The problem is we call vfs_llseek() potentially many times when
>> encoding a single reply to READ_PLUS. nfsd4_encode_read_plus() has a
>> loop where we alternate between hole and data segments until we've
>> encoded the requested number of bytes. My attempts at locking the file
>> have resulted in a deadlock since vfs_llseek() also locks the file, so
>> the file could change from underneath us during each iteration of the
>> loop.
> 
> So the problem being solved is that the current encoding is not
> atomic, rather than trying to avoid any computational overhead of
> multiple vfs_llseek calls (which are largely just the same extent
> lookups as we do during the read call)?

Reviewing [1] and [2] I don't find any remarks about atomicity
guarantees. If a client needs an uncontended view of a file's
data, it's expected to fence other accessors via a OPEN(deny)
or LOCK operation, or serialize the requests itself.

> The implementation just seems backwards to me - rather than reading
> data and then trying to work out where the holes are, I suspect it
> should be working out where the holes are and then reading the data.
> This is how the IO path in filesystems work, so it would seem like a
> no-brainer to try to leverage the infrastructure we already have to
> do that.
> 
> The information is there and we have infrastructure that exposes it
> to the IO path, it's just *underneath* the page cache and the page
> cache destroys the information that it used to build the data it
> returns to the NFSD.
> 
> IOWs, it seems to me that what READ_PLUS really wants is a "sparse
> read operation" from the filesystem rather than the current "read
> that fills holes with zeroes". i.e. a read operation that sets an
> iocb flag like RWF_SPARSE_READ to tell the filesystem to trim the
> read to just the ranges that contain data.
> 
> That way the read populates the page cache over a single contiguous
> range of data and returns with the {offset, len} that spans the
> range that is read and mapped. The caller can then read that region
> out of the page cache and mark all the non-data regions as holes in
> whatever manner they need to.
> 
> The iomap infrastructure that XFS and other filesystems use provide
> this exact "map only what contains data" capability - an iomap tells
> the page cache exactly what underlies the data range (hole, data,
> unwritten extents, etc) in an efficient manner, so it wouldn't be a
> huge stretch just to limit read IO ranges to those that contain only
> DATA extents.
> 
> At this point READ_PLUS then just needs to iterate doing sparse
> reads and recording the ranges that return data as vector of some
> kind that is then passes to the encoding function to encode it as
> a sparse READ_PLUS data range....

The iomap approach seems sensible to me and covers the two basic
usage scenarios:

- Large sparse files, where we want to conserve both network
  bandwidth and client (and intermediate) cache occupancy.
  These are best served by exposing data and holes.

- Everyday files that are relatively small and generally will
  continue few, if any, holes. These are best served by using
  a splice read (where possible) -- that is, READ_PLUS in this
  case should work exactly like READ.

My impression of client implementations is that, a priori,
a client does not know whether a file contains holes or not,
but will probably always use READ_PLUS and let the server
make the choice for it.

Now how does the server make that choice? Is there an attribute
bit that indicates when a file should be treated as sparse? Can
we assume that immutable files (or compressed files) should
always be treated as sparse? Alternately, the server might use
the file's data : hole ratio.

--
Chuck Lever

[1] https://datatracker.ietf.org/doc/html/rfc7862#section-6

[2] https://datatracker.ietf.org/doc/html/rfc5661#section-18.22