Re: [PATCH v2 3/3] nfsd: start non-blocking writeback after adding nfsd_file to the LRU

Chuck Lever III <chuck.lever@xxxxxxxxxx> · Fri, 28 Oct 2022 18:53:21 +0000

> On Oct 28, 2022, at 1:43 PM, Jeff Layton <jlayton@xxxxxxxxxx> wrote:
> 
> On Fri, 2022-10-28 at 17:21 +0000, Chuck Lever III wrote:
>> 
>>> On Oct 28, 2022, at 11:51 AM, Jeff Layton <jlayton@xxxxxxxxxx> wrote:
>>> 
>>> On Fri, 2022-10-28 at 15:29 +0000, Chuck Lever III wrote:
>>>> 
>>>>> On Oct 28, 2022, at 11:05 AM, Jeff Layton <jlayton@xxxxxxxxxx> wrote:
>>>>> 
>>>>> The problem with not calling vfs_fsync is that we might miss writeback
>>>>> errors. The nfsd_file could get reaped before a v3 COMMIT ever comes in.
>>>>> nfsd would eventually reopen the file but it could miss seeing the error
>>>>> if it got opened locally in the interim.
>>>> 
>>>> That helps. So we're surfacing writeback errors for local writers?
>>>> 
>>> 
>>> Well for non-v3 writers anyway. I suppose you could hit the same
>>> scenario with a mixed v3 and v4 workload if you were unlucky enough, or
>>> mixed v3 and ksmbd workload, etc...
>>> 
>>>> I guess I would like this flushing to interfere as little as possible
>>>> with the server's happy zone, since it's not something clients need to
>>>> wait for, and an error is exceptionally rare.
>>>> 
>>>> But also, we can't let writeback errors hold onto a bunch of memory
>>>> indefinitely. How much nfsd_file and page cache memory might be be
>>>> pinned by a writeback error, and for how long?
>>>> 
>>> 
>>> You mean if we were to stop trying to fsync out when closing? We don't
>>> keep files in the cache indefinitely, even if they have writeback
>>> errors.
>>> 
>>> In general, the kernel attempts to write things out, and if it fails it
>>> sets a writeback error in the mapping and marks the pages clean. So if
>>> we're talking about files that are no longer being used (since they're
>>> being GC'ed), we only block reaping them for as long as writeback is in
>>> progress.
>>> 
>>> Once writeback ends and it's eligible for reaping, we'll call vfs_fsync
>>> a final time, grab the error and reset the write verifier when it's
>>> non-zero.
>>> 
>>> If we stop doing fsyncs, then that model sort of breaks down. I'm not
>>> clear on what you'd like to do instead.
>> 
>> I'm not clear either. I think I just have some hand-wavy requirements.
>> 
>> I think keeping the flushes in the nfsd threads and away from single-
>> threaded garbage collection makes sense. Keep I/O in nfsd context, not
>> in the filecache garbage collector. I'm not sure that's guaranteed
>> if the garbage collection thread does an nfsd_file_put() that flushes.
>> 
> 
> The garbage collector doesn't call nfsd_file_put directly, though it
> will call nfsd_file_free, which now does a vfs_fsync.

OK, thought I saw some email fly by that suggested using nfsd_file_put
in the garbage collector. But... the vfs_fsync you mention can possibly
trigger I/O and wait for it (it's not the SYNC_NONE flush) in the GC
thread. Rare, but I'd rather not have even that possibility if we can
avoid it.

>> But, back to the topic of this patch: my own experiments with background
>> syncing showed that it introduces significant overhead and it wasn't
>> really worth the trouble. Basically, on intensive workloads, the garbage
>> collector must not stall or live-lock because it's walking through
>> millions of pages trying to figure out which ones are dirty.
>> 
> 
> If this is what you want, then kicking off SYNC_NONE writeback when we
> put it on the LRU is the right thing to do.
> 
> We want to ensure that when the thing is reaped from the LRU, that the
> final vfs_fsync has to write nothing back. The penultimate put that adds
> it to the LRU is almost certainly going to come in the context of an
> nfsd thread, so kicking off background writeback at that point seems
> reasonable.

IIUC the opposing idea is to do a synchronous writeback in nfsd_file_put
and then nothing in nfsd_file_free. Why isn't that adequate to achieve
the same result ?

Thinking aloud:

- Suppose a client does some UNSTABLE NFSv3 WRITEs to a file
- The client then waits long enough for the nfsd_file to get aged out
  of the filecache
- A local writer on the server triggers a writeback error on the file
- The error is cleared by other activity
- The client sends a COMMIT

Wouldn't the server miss the chance to bump its write verifier in that
case?

> Only files that aren't touched again get reaped off the LRU eventually,
> so there should be no danger of nfsd redirtying it again.

At the risk of rat-holing... IIUC the only case we should care about
is if there are outstanding NFSv3 WRITEs that haven't been COMMITed.
Seems like NFSv3-specific code, and not the filecache, should deal
with that case, and leave nfsd_file_put/free out of it...? Again, no
clear idea how it would, but just thinking about the layering here.

> By the time we
> get to reaping it, everything should be written back and the inode will
> be ready to close with little delay.

--
Chuck Lever