On Thu, 13 Jun 2024, Jan Kara wrote: > Commit 6df25e58532b ("nfs: remove reliance on bdi congestion") > introduced NFS-private solution for limiting number of writes > outstanding against a particular server. Unlike previous bdi congestion > this algorithm actually works and limits number of outstanding writeback > pages to nfs_congestion_kb which scales with amount of client's memory > and is capped at 256 MB. As a result some workloads such as random > buffered writes over NFS got slower (from ~170 MB/s to ~126 MB/s). The > fio command to reproduce is: > > fio --direct=0 --ioengine=sync --thread --invalidate=1 --group_reporting=1 > --runtime=300 --fallocate=posix --ramp_time=10 --new_group --rw=randwrite > --size=64256m --numjobs=4 --bs=4k --fsync_on_close=1 --end_fsync=1 > > This happens because the client sends ~256 MB worth of dirty pages to > the server and any further background writeback request is ignored until > the number of writeback pages gets below the threshold of 192 MB. By the > time this happens and clients decides to trigger another round of > writeback, the server often has no pages to write and the disk is idle. > > To fix this problem and make the client react faster to eased congestion > of the server by blocking waiting for congestion to resolve instead of > aborting writeback. This improves the random 4k buffered write > throughput to 184 MB/s. > > Signed-off-by: Jan Kara <jack@xxxxxxx> > --- > fs/nfs/client.c | 1 + > fs/nfs/write.c | 12 ++++++++---- > include/linux/nfs_fs_sb.h | 1 + > 3 files changed, 10 insertions(+), 4 deletions(-) > > diff --git a/fs/nfs/client.c b/fs/nfs/client.c > index 3b252dceebf5..8286edd6062d 100644 > --- a/fs/nfs/client.c > +++ b/fs/nfs/client.c > @@ -994,6 +994,7 @@ struct nfs_server *nfs_alloc_server(void) > > server->change_attr_type = NFS4_CHANGE_TYPE_IS_UNDEFINED; > > + init_waitqueue_head(&server->write_congestion_wait); > atomic_long_set(&server->writeback, 0); > > ida_init(&server->openowner_id); > diff --git a/fs/nfs/write.c b/fs/nfs/write.c > index c6255d7edd3c..21a5f1e90859 100644 > --- a/fs/nfs/write.c > +++ b/fs/nfs/write.c > @@ -423,8 +423,10 @@ static void nfs_folio_end_writeback(struct folio *folio) > > folio_end_writeback(folio); > if (atomic_long_dec_return(&nfss->writeback) < > - NFS_CONGESTION_OFF_THRESH) > + NFS_CONGESTION_OFF_THRESH) { > nfss->write_congested = 0; > + wake_up_all(&nfss->write_congestion_wait); > + } > } > > static void nfs_page_end_writeback(struct nfs_page *req) > @@ -698,12 +700,14 @@ int nfs_writepages(struct address_space *mapping, struct writeback_control *wbc) > struct nfs_pageio_descriptor pgio; > struct nfs_io_completion *ioc = NULL; > unsigned int mntflags = NFS_SERVER(inode)->flags; > + struct nfs_server *nfss = NFS_SERVER(inode); > int priority = 0; > int err; > > - if (wbc->sync_mode == WB_SYNC_NONE && > - NFS_SERVER(inode)->write_congested) > - return 0; > + /* Wait with writeback until write congestion eases */ > + if (wbc->sync_mode == WB_SYNC_NONE && nfss->write_congested) > + wait_event(nfss->write_congestion_wait, > + nfss->write_congested == 0); If there is a network problem or the server reboot, this could wait indefinitely. Maybe wait_event_idle() ?? Is this only called from the writeback thread that is dedicated to this fs? If so that should be fine. If it can be reached from a user processes, then wait_event_killable() might be needed. Thanks, NeilBrown > > nfs_inc_stats(inode, NFSIOS_VFSWRITEPAGES); > > diff --git a/include/linux/nfs_fs_sb.h b/include/linux/nfs_fs_sb.h > index 92de074e63b9..38a128796a76 100644 > --- a/include/linux/nfs_fs_sb.h > +++ b/include/linux/nfs_fs_sb.h > @@ -140,6 +140,7 @@ struct nfs_server { > struct rpc_clnt * client_acl; /* ACL RPC client handle */ > struct nlm_host *nlm_host; /* NLM client handle */ > struct nfs_iostats __percpu *io_stats; /* I/O statistics */ > + wait_queue_head_t write_congestion_wait; /* wait until write congestion eases */ > atomic_long_t writeback; /* number of writeback pages */ > unsigned int write_congested;/* flag set when writeback gets too high */ > unsigned int flags; /* various flags */ > -- > 2.35.3 > >