Re: [PATCH] nfs: add 'noextend' option for lock-less 'lost writes' prevention

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, 2024-06-19 at 09:32 -0400, Trond Myklebust wrote:
> On Tue, 2024-06-18 at 22:44 -0700, Christoph Hellwig wrote:
> > On Tue, Jun 18, 2024 at 06:33:13PM +0300, Dan Aloni wrote:
> > > --- a/fs/nfs/write.c
> > > +++ b/fs/nfs/write.c
> > > @@ -1315,7 +1315,10 @@ static int nfs_can_extend_write(struct
> > > file
> > > *file, struct folio *folio,
> > >  	struct file_lock_context *flctx =
> > > locks_inode_context(inode);
> > >  	struct file_lock *fl;
> > >  	int ret;
> > > +	unsigned int mntflags = NFS_SERVER(inode)->flags;
> > >  
> > > +	if (mntflags & NFS_MOUNT_NO_EXTEND)
> > > +		return 0;
> > >  	if (file->f_flags & O_DSYNC)
> > >  		return 0;
> > >  	if (!nfs_folio_write_uptodate(folio, pagelen))
> > 
> > I find the logic in nfs_update_folio to extend the write to the
> > entire
> > folio rather weird, and especially bad with the larger folio
> > support
> > I
> > just added.
> > 
> > It makes the client write more (and with large page sizes or large
> > folios) potentially a lot more than what the application asked for.
> > 
> > The comment above nfs_can_extend_write suggest it is done to avoid
> > "fragmentation".  My immediate reaction assumed that would be about
> > file
> > system fragmentation, which seems odd given that I'd expect servers
> > to
> > either log data, in which case this just increases write
> > amplification
> > for no good reason, or use something like the Linux page cache in
> > which
> > case it would be entirely pointless.
> 
> If you have a workload that does something like a 10 byte write, then
> leaves a  hole of 20 bytes, then another 10 byte write, ... then that
> workload will produce a train of 10 byte write RPC calls. That ends
> up
> being incredibly slow for obvious reasons: you are forcing the server
> to process a load of 10 byte long RPC calls, all of which are
> contending for the inode lock for the same file.
> 
> If the client knows that the holes are just that, or it knows the
> data
> that was previously written in that area (because the folio is up to
> date) then it can consolidate all those 10 bytes writes into 1MB
> write.
> So we end up compressing ~35000 RPC calls into one. Why is that not a
> good thing?
> 

BTW: this is not just a theoretical thing. Look at the way that glibc
handles a size-extending fallocate() on filesystems that don't have
native support, by writing a byte of information on every 4k boundary.
That's not quite as dramatic as my 10 byte example above, but it still
does reduce the number of required write RPC calls by a factor of 256.

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@xxxxxxxxxxxxxxx






[Index of Archives]     [Linux Filesystem Development]     [Linux USB Development]     [Linux Media Development]     [Video for Linux]     [Linux NILFS]     [Linux Audio Users]     [Yosemite Info]     [Linux SCSI]

  Powered by Linux