Re: [PATCH] nfs: add 'noextend' option for lock-less 'lost writes' prevention

Christoph Hellwig <hch@xxxxxxxxxxxxx> · Tue, 18 Jun 2024 22:44:45 -0700

On Tue, Jun 18, 2024 at 06:33:13PM +0300, Dan Aloni wrote:
> --- a/fs/nfs/write.c
> +++ b/fs/nfs/write.c
> @@ -1315,7 +1315,10 @@ static int nfs_can_extend_write(struct file *file, struct folio *folio,
>  	struct file_lock_context *flctx = locks_inode_context(inode);
>  	struct file_lock *fl;
>  	int ret;
> +	unsigned int mntflags = NFS_SERVER(inode)->flags;
>  
> +	if (mntflags & NFS_MOUNT_NO_EXTEND)
> +		return 0;
>  	if (file->f_flags & O_DSYNC)
>  		return 0;
>  	if (!nfs_folio_write_uptodate(folio, pagelen))

I find the logic in nfs_update_folio to extend the write to the entire
folio rather weird, and especially bad with the larger folio support I
just added.

It makes the client write more (and with large page sizes or large
folios) potentially a lot more than what the application asked for.

The comment above nfs_can_extend_write suggest it is done to avoid
"fragmentation".  My immediate reaction assumed that would be about file
system fragmentation, which seems odd given that I'd expect servers to
either log data, in which case this just increases write amplification
for no good reason, or use something like the Linux page cache in which
case it would be entirely pointless.

But when following git blame over a few rounds of fixes (that all narrow
down the scope of this optimization because it caused problems) the
"fragmentation" eventually becomes:

	/* If we're not using byte range locks, and we know the page
	 * is entirely in cache, it may be more efficient to avoid
	 * fragmenting write requests.
	 */

Which to me suggests it is about struct nfs_page and the on-the-wire
RPCs.  In which case the merging in nfs_try_to_update_request that
merges consecutive I/O should take care of all the interesting cases.

In other words:  I strongly suspect everyone is better off if this
extending write behavior is removed or at least not the default.