Re: [PATCH v2 3/3] nfsd: start non-blocking writeback after adding nfsd_file to the LRU

Chuck Lever III <chuck.lever@xxxxxxxxxx> · Fri, 28 Oct 2022 17:21:27 +0000

> On Oct 28, 2022, at 11:51 AM, Jeff Layton <jlayton@xxxxxxxxxx> wrote:
> 
> On Fri, 2022-10-28 at 15:29 +0000, Chuck Lever III wrote:
>> 
>>> On Oct 28, 2022, at 11:05 AM, Jeff Layton <jlayton@xxxxxxxxxx> wrote:
>>> 
>>> On Fri, 2022-10-28 at 13:16 +0000, Chuck Lever III wrote:
>>>> 
>>>>> On Oct 27, 2022, at 5:52 PM, Jeff Layton <jlayton@xxxxxxxxxx> wrote:
>>>>> 
>>>>> When a GC entry gets added to the LRU, kick off SYNC_NONE writeback
>>>>> so that we can be ready to close it out when the time comes.
>>>> 
>>>> For a large file, a background flush still has to walk the file's
>>>> pages to see if they are dirty, and that consumes time, CPU, and
>>>> memory bandwidth. We're talking hundreds of microseconds for a
>>>> large file.
>>>> 
>>>> Then the final flush does all that again.
>>>> 
>>>> Basically, two (or more!) passes through the file for exactly the
>>>> same amount of work. Is there any measured improvement in latency
>>>> or throughput?
>>>> 
>>>> And then... for a GC file, no-one is waiting on data persistence
>>>> during nfsd_file_put() so I'm not sure what is gained by taking
>>>> control of the flushing process away from the underlying filesystem.
>>>> 
>>>> 
>>>> Remind me why the filecache is flushing? Shouldn't NFSD rely on
>>>> COMMIT operations for that? (It's not obvious reading the code,
>>>> maybe there should be a documenting comment somewhere that
>>>> explains this arrangement).
>>>> 
>>> 
>>> 
>>> Fair point. I was trying to replicate the behaviors introduced in these
>>> patches:
>>> 
>>> b6669305d35a nfsd: Reduce the number of calls to nfsd_file_gc()
>>> 6b8a94332ee4 nfsd: Fix a write performance regression
>>> 
>>> AFAICT, the fsync is there to catch writeback errors so that we can
>>> reset the write verifiers (AFAICT). The rationale for that is described
>>> here:
>>> 
>>> 055b24a8f230 nfsd: Don't garbage collect files that might contain write errors
>> 
>> Yes, I've been confused about this since then :-)
>> 
>> So, the patch description says:
>> 
>>    If a file may contain unstable writes that can error out, then we want
>>    to avoid garbage collecting the struct nfsd_file that may be
>>    tracking those errors.
>> 
>> That doesn't explain why that's a problem, it just says what we plan to
>> do about it.
>> 
>> 
>>> The problem with not calling vfs_fsync is that we might miss writeback
>>> errors. The nfsd_file could get reaped before a v3 COMMIT ever comes in.
>>> nfsd would eventually reopen the file but it could miss seeing the error
>>> if it got opened locally in the interim.
>> 
>> That helps. So we're surfacing writeback errors for local writers?
>> 
> 
> Well for non-v3 writers anyway. I suppose you could hit the same
> scenario with a mixed v3 and v4 workload if you were unlucky enough, or
> mixed v3 and ksmbd workload, etc...
> 
>> I guess I would like this flushing to interfere as little as possible
>> with the server's happy zone, since it's not something clients need to
>> wait for, and an error is exceptionally rare.
>> 
>> But also, we can't let writeback errors hold onto a bunch of memory
>> indefinitely. How much nfsd_file and page cache memory might be be
>> pinned by a writeback error, and for how long?
>> 
> 
> You mean if we were to stop trying to fsync out when closing? We don't
> keep files in the cache indefinitely, even if they have writeback
> errors.
> 
> In general, the kernel attempts to write things out, and if it fails it
> sets a writeback error in the mapping and marks the pages clean. So if
> we're talking about files that are no longer being used (since they're
> being GC'ed), we only block reaping them for as long as writeback is in
> progress.
> 
> Once writeback ends and it's eligible for reaping, we'll call vfs_fsync
> a final time, grab the error and reset the write verifier when it's
> non-zero.
> 
> If we stop doing fsyncs, then that model sort of breaks down. I'm not
> clear on what you'd like to do instead.

I'm not clear either. I think I just have some hand-wavy requirements.

I think keeping the flushes in the nfsd threads and away from single-
threaded garbage collection makes sense. Keep I/O in nfsd context, not
in the filecache garbage collector. I'm not sure that's guaranteed
if the garbage collection thread does an nfsd_file_put() that flushes.

And, we need to ensure that an nfsd_file isn't pinned forever -- the
flush has to make forward progress so that the nfsd_file is eventually
released. Otherwise, writeback errors become a DoS vector.

But, back to the topic of this patch: my own experiments with background
syncing showed that it introduces significant overhead and it wasn't
really worth the trouble. Basically, on intensive workloads, the garbage
collector must not stall or live-lock because it's walking through
millions of pages trying to figure out which ones are dirty.

>>> I'm not sure we need to worry about that so much for v4 though. Maybe we
>>> should just do this for GC files?
>> 
>> I'm not caffeinated yet. Why is it not a problem for v4? Is it because
>> an open or delegation stateid will prevent the nfsd_file from going
>> away?
>> 
> 
> 
> Yeah, more or less.
> 
> I think that for a error to be lost with v4, it would require the client
> to have an application access pattern that would expose it to that
> possibility on a local filesystem as well. I don't think we have any
> obligation to do more there.
> 
> Maybe that's a false assumption though.
> 
>> Sorry for the noise. It's all a little subtle.
>> 
> 
> Very subtle. The more we can get this fleshed out into comments the
> better, so I welcome the questions.
> 
>> 
>>>>> Signed-off-by: Jeff Layton <jlayton@xxxxxxxxxx>
>>>>> ---
>>>>> fs/nfsd/filecache.c | 37 +++++++++++++++++++++++++++++++------
>>>>> 1 file changed, 31 insertions(+), 6 deletions(-)
>>>>> 
>>>>> diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
>>>>> index d2bbded805d4..491d3d9a1870 100644
>>>>> --- a/fs/nfsd/filecache.c
>>>>> +++ b/fs/nfsd/filecache.c
>>>>> @@ -316,7 +316,7 @@ nfsd_file_alloc(struct nfsd_file_lookup_key *key, unsigned int may)
>>>>> }
>>>>> 
>>>>> static void
>>>>> -nfsd_file_flush(struct nfsd_file *nf)
>>>>> +nfsd_file_fsync(struct nfsd_file *nf)
>>>>> {
>>>>> 	struct file *file = nf->nf_file;
>>>>> 
>>>>> @@ -327,6 +327,22 @@ nfsd_file_flush(struct nfsd_file *nf)
>>>>> 		nfsd_reset_write_verifier(net_generic(nf->nf_net, nfsd_net_id));
>>>>> }
>>>>> 
>>>>> +static void
>>>>> +nfsd_file_flush(struct nfsd_file *nf)
>>>>> +{
>>>>> +	struct file *file = nf->nf_file;
>>>>> +	unsigned long nrpages;
>>>>> +
>>>>> +	if (!file || !(file->f_mode & FMODE_WRITE))
>>>>> +		return;
>>>>> +
>>>>> +	nrpages = file->f_mapping->nrpages;
>>>>> +	if (nrpages) {
>>>>> +		this_cpu_add(nfsd_file_pages_flushed, nrpages);
>>>>> +		filemap_flush(file->f_mapping);
>>>>> +	}
>>>>> +}
>>>>> +
>>>>> static void
>>>>> nfsd_file_free(struct nfsd_file *nf)
>>>>> {
>>>>> @@ -337,7 +353,7 @@ nfsd_file_free(struct nfsd_file *nf)
>>>>> 	this_cpu_inc(nfsd_file_releases);
>>>>> 	this_cpu_add(nfsd_file_total_age, age);
>>>>> 
>>>>> -	nfsd_file_flush(nf);
>>>>> +	nfsd_file_fsync(nf);
>>>>> 
>>>>> 	if (nf->nf_mark)
>>>>> 		nfsd_file_mark_put(nf->nf_mark);
>>>>> @@ -500,12 +516,21 @@ nfsd_file_put(struct nfsd_file *nf)
>>>>> 
>>>>> 	if (test_bit(NFSD_FILE_GC, &nf->nf_flags)) {
>>>>> 		/*
>>>>> -		 * If this is the last reference (nf_ref == 1), then transfer
>>>>> -		 * it to the LRU. If the add to the LRU fails, just put it as
>>>>> -		 * usual.
>>>>> +		 * If this is the last reference (nf_ref == 1), then try
>>>>> +		 * to transfer it to the LRU.
>>>>> +		 */
>>>>> +		if (refcount_dec_not_one(&nf->nf_ref))
>>>>> +			return;
>>>>> +
>>>>> +		/*
>>>>> +		 * If the add to the list succeeds, try to kick off SYNC_NONE
>>>>> +		 * writeback. If the add fails, then just fall through to
>>>>> +		 * decrement as usual.
>>>> 
>>>> These comments simply repeat what the code does, so they seem
>>>> redundant to me. Could they instead explain why?
>>>> 
>>>> 
>>>>> 		 */
>>>>> -		if (refcount_dec_not_one(&nf->nf_ref) || nfsd_file_lru_add(nf))
>>>>> +		if (nfsd_file_lru_add(nf)) {
>>>>> +			nfsd_file_flush(nf);
>>>>> 			return;
>>>>> +		}
>>>>> 	}
>>>>> 	__nfsd_file_put(nf);
>>>>> }
>>>>> -- 
>>>>> 2.37.3
>>>>> 
>>>> 
>>>> --
>>>> Chuck Lever
>>>> 
>>>> 
>>>> 
>>> 
>>> -- 
>>> Jeff Layton <jlayton@xxxxxxxxxx>
>> 
>> --
>> Chuck Lever
>> 
>> 
>> 
> 
> -- 
> Jeff Layton <jlayton@xxxxxxxxxx>

--
Chuck Lever