Re: [PATCH v3 6/6] NFSD: Repeal and replace the READ_PLUS implementation

Chuck Lever III <chuck.lever@xxxxxxxxxx> · Tue, 19 Jul 2022 17:21:21 +0000

> On Jul 17, 2022, at 9:15 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> 
> On Fri, Jul 15, 2022 at 07:08:13PM +0000, Chuck Lever III wrote:
>>> On Jul 15, 2022, at 2:44 PM, Anna Schumaker <anna@xxxxxxxxxx> wrote:
>>> 
>>> +nfsd4_encode_read_plus_segments(struct nfsd4_compoundres *resp,
>>> +				struct nfsd4_read *read,
>>> +				unsigned int *segments, u32 *eof)
>>> {
>>> -	struct file *file = read->rd_nf->nf_file;
>>> -	loff_t data_pos = vfs_llseek(file, read->rd_offset, SEEK_DATA);
>>> -	loff_t f_size = i_size_read(file_inode(file));
>>> -	unsigned long count;
>>> -	__be32 *p;
>>> +	struct xdr_stream *xdr = resp->xdr;
>>> +	unsigned int bufpos = xdr->buf->len;
>>> +	u64 offset = read->rd_offset;
>>> +	struct read_plus_segment segment;
>>> +	enum data_content4 pagetype;
>>> +	unsigned long maxcount;
>>> +	unsigned int pagenum = 0;
>>> +	unsigned int pagelen;
>>> +	char *vpage, *p;
>>> +	__be32 nfserr;
>>> 
>>> -	if (data_pos == -ENXIO)
>>> -		data_pos = f_size;
>>> -	else if (data_pos <= read->rd_offset || (data_pos < f_size && data_pos % PAGE_SIZE))
>>> -		return nfsd4_encode_read_plus_data(resp, read, maxcount, eof, &f_size);
>>> -	count = data_pos - read->rd_offset;
>>> -
>>> -	/* Content type, offset, byte count */
>>> -	p = xdr_reserve_space(resp->xdr, 4 + 8 + 8);
>>> -	if (!p)
>>> +	/* enough space for a HOLE segment before we switch to the pages */
>>> +	if (!xdr_reserve_space(xdr, 5 * XDR_UNIT))
>>> 		return nfserr_resource;
>>> +	xdr_commit_encode(xdr);
>>> 
>>> -	*p++ = htonl(NFS4_CONTENT_HOLE);
>>> -	p = xdr_encode_hyper(p, read->rd_offset);
>>> -	p = xdr_encode_hyper(p, count);
>>> +	maxcount = min_t(unsigned long, read->rd_length,
>>> +			 (xdr->buf->buflen - xdr->buf->len));
>>> 
>>> -	*eof = (read->rd_offset + count) >= f_size;
>>> -	*maxcount = min_t(unsigned long, count, *maxcount);
>>> +	nfserr = nfsd4_read_plus_readv(resp, read, &maxcount, eof);
>>> +	if (nfserr)
>>> +		return nfserr;
>>> +
>>> +	while (maxcount > 0) {
>>> +		vpage = xdr_buf_nth_page_address(xdr->buf, pagenum, &pagelen);
>>> +		pagelen = min_t(unsigned int, pagelen, maxcount);
>>> +		if (!vpage || pagelen == 0)
>>> +			break;
>>> +		p = memchr_inv(vpage, 0, pagelen);
>> 
>> I'm still not happy about touching every byte in each READ_PLUS
>> payload. I think even though the rest of this work is merge-ready,
>> this is a brute-force mechanism that's OK for a proof of concept
>> but not appropriate for production-ready code.
> 
> Seems like a step backwards as it defeats the benefit zero-copy read
> IO paths on the server side....

Tom Haynes' vision for READ_PLUS was to eventually replace the
legacy READ operation. That means READ_PLUS(CONTENT_DATA) needs
to be as fast and efficient as plain READ. (It would be great
to use splice reads for CONTENT_DATA if we can!)

But I also thought the purpose of READ_PLUS was to help clients
preserve unallocated extents in files during copy operations.
An unallocated extent is not the same as an allocated extent
that has zeroes written into it. IIUC this new logic does not
distinguish between those two cases at all. (And please correct
me if this is really not the goal of READ_PLUS).

I would like to retain precise detection of unallocated extents
in files. Maybe SEEK_HOLE/SEEK_DATA is not how to do that, but
my feeling is that checking for zero bytes is definitely not
the way to do it.

>> I've cc'd linux-fsdevel to see if we can get some more ideas going
>> and move this forward.
>> 
>> Another thought I had was to support returning only one or two
>> segments per reply. One CONTENT segment, one HOLE segment, or one
>> of each. Would that be enough to prevent the issues around file
>> updates racing with the construction of the reply?
> 
> Before I can make any sort of useful suggestion, I need to have it
> explained to me why we care if the underlying file mapping has
> changed between the read of the data and the SEEK_HOLE trim check,
> because it's not at all clear to me what problem this change is
> actually solving.

The cover letter for this series calls out generic/091 and
generic/263 -- it mentioned both are failing with NFSv4.2. I've
tried to reproduce failures here, but both passed.

Anna, can you provide a root-cause analysis of what is failing
in your testing? Maybe a reproducer for us to kick around?
I'm guessing you might be encountering races because your
usual test environment is virtual, so we need to understand
how timing effects the results.

--
Chuck Lever