> On Mar 31, 2022, at 10:20 AM, Chuck Lever III <chuck.lever@xxxxxxxxxx> wrote: > > > >> On Mar 31, 2022, at 9:09 AM, Jan Kara <jack@xxxxxxx> wrote: >> >> On Wed 30-03-22 22:02:39, Trond Myklebust wrote: >>> On Wed, 2022-03-30 at 17:56 +0000, Chuck Lever III wrote: >>>> >>>> >>>>> On Mar 30, 2022, at 12:19 PM, Jan Kara <jack@xxxxxxx> wrote: >>>>> >>>>> On Wed 30-03-22 15:38:38, Chuck Lever III wrote: >>>>>>> On Mar 30, 2022, at 11:03 AM, Trond Myklebust >>>>>>> <trondmy@xxxxxxxxxxxxxxx> wrote: >>>>>>> >>>>>>> On Wed, 2022-03-30 at 12:34 +0200, Jan Kara wrote: >>>>>>>> Hello, >>>>>>>> >>>>>>>> during our performance testing we have noticed that commit >>>>>>>> b6669305d35a >>>>>>>> ("nfsd: Reduce the number of calls to nfsd_file_gc()") has >>>>>>>> introduced >>>>>>>> a >>>>>>>> performance regression when a client does random buffered >>>>>>>> writes. The >>>>>>>> workload on NFS client is fio running 4 processed doing >>>>>>>> random >>>>>>>> buffered writes to 4 >>>>>>>> different files and the files are large enough to hit dirty >>>>>>>> limits >>>>>>>> and >>>>>>>> force writeback from the client. In particular the invocation >>>>>>>> is >>>>>>>> like: >>>>>>>> >>>>>>>> fio --direct=0 --ioengine=sync --thread --directory=/mnt/mnt1 >>>>>>>> -- >>>>>>>> invalidate=1 --group_reporting=1 --runtime=300 -- >>>>>>>> fallocate=posix -- >>>>>>>> ramp_time=10 --name=RandomReads-128000-4k-4 --new_group -- >>>>>>>> rw=randwrite --size=4000m --numjobs=4 --bs=4k -- >>>>>>>> filename_format=FioWorkloads.\$jobnum --end_fsync=1 >>>>>>>> >>>>>>>> The reason why commit b6669305d35a regresses performance is >>>>>>>> the >>>>>>>> filemap_flush() call it adds into nfsd_file_put(). Before >>>>>>>> this commit >>>>>>>> writeback on the server happened from nfsd_commit() code >>>>>>>> resulting in >>>>>>>> rather long semisequential streams of 4k writes. After commit >>>>>>>> b6669305d35a >>>>>>>> all the writeback happens from filemap_flush() calls >>>>>>>> resulting in >>>>>>>> much >>>>>>>> longer average seek distance (IO from different files is more >>>>>>>> interleaved) >>>>>>>> and about 16-20% regression in the achieved writeback >>>>>>>> throughput when >>>>>>>> the >>>>>>>> backing store is rotational storage. >>>>>>>> >>>>>>>> I think the filemap_flush() from nfsd_file_put() is indeed >>>>>>>> rather >>>>>>>> aggressive and I think we'd be better off to just leave >>>>>>>> writeback to >>>>>>>> either >>>>>>>> nfsd_commit() or standard dirty page cleaning happening on >>>>>>>> the >>>>>>>> system. I >>>>>>>> assume the rationale for the filemap_flush() call was to make >>>>>>>> it more >>>>>>>> likely the file can be evicted during the garbage collection >>>>>>>> run? Was >>>>>>>> there >>>>>>>> any particular problem leading to addition of this call or >>>>>>>> was it >>>>>>>> just "it >>>>>>>> seemed like a good idea" thing? >>>>>>>> >>>>>>>> Thanks in advance for ideas. >>>>>>>> >>>>>>>> >>>>>>>> Honza >>>>>>> >>>>>>> It was mainly introduced to reduce the amount of work that >>>>>>> nfsd_file_free() needs to do. In particular when re-exporting >>>>>>> NFS, the >>>>>>> call to filp_close() can be expensive because it synchronously >>>>>>> flushes >>>>>>> out dirty pages. That again means that some of the calls to >>>>>>> nfsd_file_dispose_list() can end up being very expensive >>>>>>> (particularly >>>>>>> the ones run by the garbage collector itself). >>>>>> >>>>>> The "no regressions" rule suggests that some kind of action needs >>>>>> to be taken. I don't have a sense of whether Jan's workload or >>>>>> NFS >>>>>> re-export is the more common use case, however. >>>>>> >>>>>> I can see that filp_close() on a file on an NFS mount could be >>>>>> costly if that file has dirty pages, due to the NFS client's >>>>>> "flush on close" semantic. >>>>>> >>>>>> Trond, are there alternatives to flushing in the nfsd_file_put() >>>>>> path? I'm willing to entertain bug fix patches rather than a >>>>>> mechanical revert of b6669305d35a. >>>>> >>>>> Yeah, I don't think we need to rush fixing this with a revert. >>>> >>>> Sorry I wasn't clear: I would prefer to apply a bug fix over >>>> sending a revert commit, and I do not have enough information >>>> yet to make that choice. Waiting a bit is OK with me. >>>> >>>> >>>>> Also because >>>>> just removing the filemap_flush() from nfsd_file_put() would keep >>>>> other >>>>> benefits of that commit while fixing the regression AFAIU. But I >>>>> think >>>>> making flushing less aggressive is desirable because as I wrote in >>>>> my other >>>>> reply, currently it is preventing pagecache from accumulating >>>>> enough dirty >>>>> data for a good IO pattern. >>>> >>>> I might even go so far as to say that a small write workload >>>> isn't especially good for solid state storage either. >>>> >>>> I know Trond is trying to address NFS re-export performance, but >>>> there appear to be some palpable effects outside of that narrow >>>> use case that need to be considered. Thus a server-side fix, >>>> rather than a fix in the NFS client used to do the re-export, >>>> seems appropriate to consider. >>> >>> Turns out it is not just the NFS client that is the problem. It is >>> rather that we need in general to be able to detect flush errors and >>> either report them directly (through commit) or we need to change the >>> boot verifier to force clients to resend the unstable writes. >>> >>> Hence, I think we're looking at something like this: >> >> Thanks for the fix! I've run the patch through some testing and your fix >> indeed restores the good IO pattern and returns the performance back to >> original levels. So feel free to add: >> >> Tested-by: Jan Kara <jack@xxxxxxx> > > Excellent. I'll queue this up in the NFSD tree for 5.17-rc. Belay that. 5.18-rc. >> >> Honza >> >>> >>> 8<-------------------------------------------------------------------- >>> From c0c89267f303432c8f5e490ea9b075856e4be79d Mon Sep 17 00:00:00 2001 >>> From: Trond Myklebust <trond.myklebust@xxxxxxxxxxxxxxx> >>> Date: Wed, 30 Mar 2022 16:55:38 -0400 >>> Subject: [PATCH] nfsd: Fix a write performance regression >>> >>> The call to filemap_flush() in nfsd_file_put() is there to ensure that >>> we clear out any writes belonging to a NFSv3 client relatively quickly >>> and avoid situations where the file can't be evicted by the garbage >>> collector. It also ensures that we detect write errors quickly. >>> >>> The problem is this causes a regression in performance for some >>> workloads. >>> >>> So try to improve matters by deferring writeback until we're ready to >>> close the file, and need to detect errors so that we can force the >>> client to resend. >>> >>> Signed-off-by: Trond Myklebust <trond.myklebust@xxxxxxxxxxxxxxx> >>> --- >>> fs/nfsd/filecache.c | 18 +++++++++++++++--- >>> 1 file changed, 15 insertions(+), 3 deletions(-) >>> >>> diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c >>> index 8bc807c5fea4..9578a6317709 100644 >>> --- a/fs/nfsd/filecache.c >>> +++ b/fs/nfsd/filecache.c >>> @@ -235,6 +235,13 @@ nfsd_file_check_write_error(struct nfsd_file *nf) >>> return filemap_check_wb_err(file->f_mapping, READ_ONCE(file->f_wb_err)); >>> } >>> >>> +static void >>> +nfsd_file_flush(struct nfsd_file *nf) >>> +{ >>> + if (nf->nf_file && vfs_fsync(nf->nf_file, 1) != 0) >>> + nfsd_reset_write_verifier(net_generic(nf->nf_net, nfsd_net_id)); >>> +} >>> + >>> static void >>> nfsd_file_do_unhash(struct nfsd_file *nf) >>> { >>> @@ -302,11 +309,14 @@ nfsd_file_put(struct nfsd_file *nf) >>> return; >>> } >>> >>> - filemap_flush(nf->nf_file->f_mapping); >>> is_hashed = test_bit(NFSD_FILE_HASHED, &nf->nf_flags) != 0; >>> - nfsd_file_put_noref(nf); >>> - if (is_hashed) >>> + if (!is_hashed) { >>> + nfsd_file_flush(nf); >>> + nfsd_file_put_noref(nf); >>> + } else { >>> + nfsd_file_put_noref(nf); >>> nfsd_file_schedule_laundrette(); >>> + } >>> if (atomic_long_read(&nfsd_filecache_count) >= NFSD_FILE_LRU_LIMIT) >>> nfsd_file_gc(); >>> } >>> @@ -327,6 +337,7 @@ nfsd_file_dispose_list(struct list_head *dispose) >>> while(!list_empty(dispose)) { >>> nf = list_first_entry(dispose, struct nfsd_file, nf_lru); >>> list_del(&nf->nf_lru); >>> + nfsd_file_flush(nf); >>> nfsd_file_put_noref(nf); >>> } >>> } >>> @@ -340,6 +351,7 @@ nfsd_file_dispose_list_sync(struct list_head *dispose) >>> while(!list_empty(dispose)) { >>> nf = list_first_entry(dispose, struct nfsd_file, nf_lru); >>> list_del(&nf->nf_lru); >>> + nfsd_file_flush(nf); >>> if (!refcount_dec_and_test(&nf->nf_ref)) >>> continue; >>> if (nfsd_file_free(nf)) >>> -- >>> 2.35.1 >>> >>> >>> >>> -- >>> Trond Myklebust >>> Linux NFS client maintainer, Hammerspace >>> trond.myklebust@xxxxxxxxxxxxxxx >>> >>> >> -- >> Jan Kara <jack@xxxxxxxx> >> SUSE Labs, CR > > -- > Chuck Lever -- Chuck Lever