Re: [PATCH] nfs: add 'noextend' option for lock-less 'lost writes' prevention

Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> · Wed, 19 Jun 2024 13:33:11 +0000

On Tue, 2024-06-18 at 22:44 -0700, Christoph Hellwig wrote:
> On Tue, Jun 18, 2024 at 06:33:13PM +0300, Dan Aloni wrote:
> > --- a/fs/nfs/write.c
> > +++ b/fs/nfs/write.c
> > @@ -1315,7 +1315,10 @@ static int nfs_can_extend_write(struct file
> > *file, struct folio *folio,
> >  	struct file_lock_context *flctx =
> > locks_inode_context(inode);
> >  	struct file_lock *fl;
> >  	int ret;
> > +	unsigned int mntflags = NFS_SERVER(inode)->flags;
> >  
> > +	if (mntflags & NFS_MOUNT_NO_EXTEND)
> > +		return 0;
> >  	if (file->f_flags & O_DSYNC)
> >  		return 0;
> >  	if (!nfs_folio_write_uptodate(folio, pagelen))
> 
> I find the logic in nfs_update_folio to extend the write to the
> entire
> folio rather weird, and especially bad with the larger folio support
> I
> just added.
> 
> It makes the client write more (and with large page sizes or large
> folios) potentially a lot more than what the application asked for.
> 
> The comment above nfs_can_extend_write suggest it is done to avoid
> "fragmentation".  My immediate reaction assumed that would be about
> file
> system fragmentation, which seems odd given that I'd expect servers
> to
> either log data, in which case this just increases write
> amplification
> for no good reason, or use something like the Linux page cache in
> which
> case it would be entirely pointless.

If you have a workload that does something like a 10 byte write, then
leaves a  hole of 20 bytes, then another 10 byte write, ... then that
workload will produce a train of 10 byte write RPC calls. That ends up
being incredibly slow for obvious reasons: you are forcing the server
to process a load of 10 byte long RPC calls, all of which are
contending for the inode lock for the same file.

If the client knows that the holes are just that, or it knows the data
that was previously written in that area (because the folio is up to
date) then it can consolidate all those 10 bytes writes into 1MB write.
So we end up compressing ~35000 RPC calls into one. Why is that not a
good thing?

> 
> But when following git blame over a few rounds of fixes (that all
> narrow
> down the scope of this optimization because it caused problems) the
> "fragmentation" eventually becomes:
> 
> 	/* If we're not using byte range locks, and we know the page
> 	 * is entirely in cache, it may be more efficient to avoid
> 	 * fragmenting write requests.
> 	 */
> 
> Which to me suggests it is about struct nfs_page and the on-the-wire
> RPCs.  In which case the merging in nfs_try_to_update_request that
> merges consecutive I/O should take care of all the interesting cases.
> 
> In other words:  I strongly suspect everyone is better off if this
> extending write behavior is removed or at least not the default.
> 

The merging in nfs_try_to_update_request() is not sufficient to fix
pathologies like the above example, no.

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@xxxxxxxxxxxxxxx