> On Oct 28, 2022, at 11:51 AM, Jeff Layton <jlayton@xxxxxxxxxx> wrote: > > On Fri, 2022-10-28 at 15:29 +0000, Chuck Lever III wrote: >> >>> On Oct 28, 2022, at 11:05 AM, Jeff Layton <jlayton@xxxxxxxxxx> wrote: >>> >>> On Fri, 2022-10-28 at 13:16 +0000, Chuck Lever III wrote: >>>> >>>>> On Oct 27, 2022, at 5:52 PM, Jeff Layton <jlayton@xxxxxxxxxx> wrote: >>>>> >>>>> When a GC entry gets added to the LRU, kick off SYNC_NONE writeback >>>>> so that we can be ready to close it out when the time comes. >>>> >>>> For a large file, a background flush still has to walk the file's >>>> pages to see if they are dirty, and that consumes time, CPU, and >>>> memory bandwidth. We're talking hundreds of microseconds for a >>>> large file. >>>> >>>> Then the final flush does all that again. >>>> >>>> Basically, two (or more!) passes through the file for exactly the >>>> same amount of work. Is there any measured improvement in latency >>>> or throughput? >>>> >>>> And then... for a GC file, no-one is waiting on data persistence >>>> during nfsd_file_put() so I'm not sure what is gained by taking >>>> control of the flushing process away from the underlying filesystem. >>>> >>>> >>>> Remind me why the filecache is flushing? Shouldn't NFSD rely on >>>> COMMIT operations for that? (It's not obvious reading the code, >>>> maybe there should be a documenting comment somewhere that >>>> explains this arrangement). >>>> >>> >>> >>> Fair point. I was trying to replicate the behaviors introduced in these >>> patches: >>> >>> b6669305d35a nfsd: Reduce the number of calls to nfsd_file_gc() >>> 6b8a94332ee4 nfsd: Fix a write performance regression >>> >>> AFAICT, the fsync is there to catch writeback errors so that we can >>> reset the write verifiers (AFAICT). The rationale for that is described >>> here: >>> >>> 055b24a8f230 nfsd: Don't garbage collect files that might contain write errors >> >> Yes, I've been confused about this since then :-) >> >> So, the patch description says: >> >> If a file may contain unstable writes that can error out, then we want >> to avoid garbage collecting the struct nfsd_file that may be >> tracking those errors. >> >> That doesn't explain why that's a problem, it just says what we plan to >> do about it. >> >> >>> The problem with not calling vfs_fsync is that we might miss writeback >>> errors. The nfsd_file could get reaped before a v3 COMMIT ever comes in. >>> nfsd would eventually reopen the file but it could miss seeing the error >>> if it got opened locally in the interim. >> >> That helps. So we're surfacing writeback errors for local writers? >> > > Well for non-v3 writers anyway. I suppose you could hit the same > scenario with a mixed v3 and v4 workload if you were unlucky enough, or > mixed v3 and ksmbd workload, etc... > >> I guess I would like this flushing to interfere as little as possible >> with the server's happy zone, since it's not something clients need to >> wait for, and an error is exceptionally rare. >> >> But also, we can't let writeback errors hold onto a bunch of memory >> indefinitely. How much nfsd_file and page cache memory might be be >> pinned by a writeback error, and for how long? >> > > You mean if we were to stop trying to fsync out when closing? We don't > keep files in the cache indefinitely, even if they have writeback > errors. > > In general, the kernel attempts to write things out, and if it fails it > sets a writeback error in the mapping and marks the pages clean. So if > we're talking about files that are no longer being used (since they're > being GC'ed), we only block reaping them for as long as writeback is in > progress. > > Once writeback ends and it's eligible for reaping, we'll call vfs_fsync > a final time, grab the error and reset the write verifier when it's > non-zero. > > If we stop doing fsyncs, then that model sort of breaks down. I'm not > clear on what you'd like to do instead. I'm not clear either. I think I just have some hand-wavy requirements. I think keeping the flushes in the nfsd threads and away from single- threaded garbage collection makes sense. Keep I/O in nfsd context, not in the filecache garbage collector. I'm not sure that's guaranteed if the garbage collection thread does an nfsd_file_put() that flushes. And, we need to ensure that an nfsd_file isn't pinned forever -- the flush has to make forward progress so that the nfsd_file is eventually released. Otherwise, writeback errors become a DoS vector. But, back to the topic of this patch: my own experiments with background syncing showed that it introduces significant overhead and it wasn't really worth the trouble. Basically, on intensive workloads, the garbage collector must not stall or live-lock because it's walking through millions of pages trying to figure out which ones are dirty. >>> I'm not sure we need to worry about that so much for v4 though. Maybe we >>> should just do this for GC files? >> >> I'm not caffeinated yet. Why is it not a problem for v4? Is it because >> an open or delegation stateid will prevent the nfsd_file from going >> away? >> > > > Yeah, more or less. > > I think that for a error to be lost with v4, it would require the client > to have an application access pattern that would expose it to that > possibility on a local filesystem as well. I don't think we have any > obligation to do more there. > > Maybe that's a false assumption though. > >> Sorry for the noise. It's all a little subtle. >> > > Very subtle. The more we can get this fleshed out into comments the > better, so I welcome the questions. > >> >>>>> Signed-off-by: Jeff Layton <jlayton@xxxxxxxxxx> >>>>> --- >>>>> fs/nfsd/filecache.c | 37 +++++++++++++++++++++++++++++++------ >>>>> 1 file changed, 31 insertions(+), 6 deletions(-) >>>>> >>>>> diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c >>>>> index d2bbded805d4..491d3d9a1870 100644 >>>>> --- a/fs/nfsd/filecache.c >>>>> +++ b/fs/nfsd/filecache.c >>>>> @@ -316,7 +316,7 @@ nfsd_file_alloc(struct nfsd_file_lookup_key *key, unsigned int may) >>>>> } >>>>> >>>>> static void >>>>> -nfsd_file_flush(struct nfsd_file *nf) >>>>> +nfsd_file_fsync(struct nfsd_file *nf) >>>>> { >>>>> struct file *file = nf->nf_file; >>>>> >>>>> @@ -327,6 +327,22 @@ nfsd_file_flush(struct nfsd_file *nf) >>>>> nfsd_reset_write_verifier(net_generic(nf->nf_net, nfsd_net_id)); >>>>> } >>>>> >>>>> +static void >>>>> +nfsd_file_flush(struct nfsd_file *nf) >>>>> +{ >>>>> + struct file *file = nf->nf_file; >>>>> + unsigned long nrpages; >>>>> + >>>>> + if (!file || !(file->f_mode & FMODE_WRITE)) >>>>> + return; >>>>> + >>>>> + nrpages = file->f_mapping->nrpages; >>>>> + if (nrpages) { >>>>> + this_cpu_add(nfsd_file_pages_flushed, nrpages); >>>>> + filemap_flush(file->f_mapping); >>>>> + } >>>>> +} >>>>> + >>>>> static void >>>>> nfsd_file_free(struct nfsd_file *nf) >>>>> { >>>>> @@ -337,7 +353,7 @@ nfsd_file_free(struct nfsd_file *nf) >>>>> this_cpu_inc(nfsd_file_releases); >>>>> this_cpu_add(nfsd_file_total_age, age); >>>>> >>>>> - nfsd_file_flush(nf); >>>>> + nfsd_file_fsync(nf); >>>>> >>>>> if (nf->nf_mark) >>>>> nfsd_file_mark_put(nf->nf_mark); >>>>> @@ -500,12 +516,21 @@ nfsd_file_put(struct nfsd_file *nf) >>>>> >>>>> if (test_bit(NFSD_FILE_GC, &nf->nf_flags)) { >>>>> /* >>>>> - * If this is the last reference (nf_ref == 1), then transfer >>>>> - * it to the LRU. If the add to the LRU fails, just put it as >>>>> - * usual. >>>>> + * If this is the last reference (nf_ref == 1), then try >>>>> + * to transfer it to the LRU. >>>>> + */ >>>>> + if (refcount_dec_not_one(&nf->nf_ref)) >>>>> + return; >>>>> + >>>>> + /* >>>>> + * If the add to the list succeeds, try to kick off SYNC_NONE >>>>> + * writeback. If the add fails, then just fall through to >>>>> + * decrement as usual. >>>> >>>> These comments simply repeat what the code does, so they seem >>>> redundant to me. Could they instead explain why? >>>> >>>> >>>>> */ >>>>> - if (refcount_dec_not_one(&nf->nf_ref) || nfsd_file_lru_add(nf)) >>>>> + if (nfsd_file_lru_add(nf)) { >>>>> + nfsd_file_flush(nf); >>>>> return; >>>>> + } >>>>> } >>>>> __nfsd_file_put(nf); >>>>> } >>>>> -- >>>>> 2.37.3 >>>>> >>>> >>>> -- >>>> Chuck Lever >>>> >>>> >>>> >>> >>> -- >>> Jeff Layton <jlayton@xxxxxxxxxx> >> >> -- >> Chuck Lever >> >> >> > > -- > Jeff Layton <jlayton@xxxxxxxxxx> -- Chuck Lever