Re: [LSF/MM/BPF TOPIC] Implementing the NFS v4.2 WRITE_SAME operation: VFS or NFS ioctl() ?

"Darrick J. Wong" <djwong@xxxxxxxxxx> · Wed, 15 Jan 2025 10:20:02 -0800

On Wed, Jan 15, 2025 at 04:19:28PM +0000, Matthew Wilcox wrote:
> On Wed, Jan 15, 2025 at 10:31:51AM -0500, Chuck Lever wrote:
> > On 1/15/25 10:06 AM, Matthew Wilcox wrote:
> > > On Tue, Jan 14, 2025 at 04:38:03PM -0500, Anna Schumaker wrote:
> > > > I've seen a few requests for implementing the NFS v4.2 WRITE_SAME [1] operation over the last few months [2][3] to accelerate writing patterns of data on the server, so it's been in the back of my mind for a future project. I'll need to write some code somewhere so NFS & NFSD can handle this request. I could keep any implementation internal to NFS / NFSD, but I'd like to find out if local filesystems would find this sort of feature useful and if I should put it in the VFS instead.
> > > 
> > > I think we need more information.  I read over the [2] and [3] threads
> > > and the spec.  It _seems like_ the intent in the spec is to expose the
> > > underlying SCSI WRITE SAME command over NFS, but at least one other
> > > response in this thread has been to design an all-singing, all-dancing
> > > superset that can write arbitrary sized blocks to arbitrary locations
> > > in every file on every filesystem, and I think we're going to design
> > > ourselves into an awful implementation if we do that.
> > > 
> > > Can we confirm with the people who actually want to use this that all
> > > they really want is to be able to do WRITE SAME as if they were on a
> > > local disc, and then we can implement that in a matter of weeks instead
> > > of taking a trip via Uranus.
> > 
> > IME it's been very difficult to get such requesters to provide the
> > detail we need to build to their requirements. Providing them with a
> > limited prototype and letting them comment is likely the fastest way to
> > converge on something useful. Press the Easy Button, then evolve.
> > 
> > Trond has suggested starting with clone_file_range, providing it with a
> > pattern and then have the VFS or file system fill exponentially larger
> > segments of the file by replicating that pattern. The question is
> > whether to let consumers simply use that API as it is, or shall we
> > provide some kind of generic infrastructure over that that provides
> > segment replication?
> > 
> > With my NFSD hat on, I would prefer to have the file version of "write
> > same" implemented outside of the NFS stack so that other consumers can
> > benefit from using the very same implementation. NFSD (and the NFS
> > client) should simply act as a conduit for these requests via the
> > NFSv4.2 WRITE_SAME operation.
> > 
> > I kinda like Dave's ideas too. Enabling offload will be critical to
> > making this feature efficient and thus valuable.
> 
> So I have some experience with designing an API like this one which may
> prove either relevant or misleading.
> 
> We have bzero() and memset().  If you want to fill with a larger pattern
> than a single byte, POSIX does not provide.  Various people have proposed
> extensions, eg
> https://github.com/ajkaijanaho/publib/blob/master/strutil/memfill.c
> 
> But what people really want is the ability to use the x86 rep
> movsw/movsl/movsq instructions.  And so in Linux we now have
> memset16/memset32/memset64/memset_l/memset_p which will map to one
> of those hardware calls.  Sure, we could implement memfill() and then
> specialcase 2/4/8 byte implementations, but nobody actually wants to
> use that.
> 
> 
> So what API actually makes sense to provide?  I suggest an ioctl,
> implemented at the VFS layer:
> 
> struct write_same {
> 	loff_t pos;	/* Where to start writing */

You probably need at least a:

	u64 count;	/* Number of bytes to write */

Since I think the point is that you write buf[len] to the file/disk over
and over again until count bytes have been written, correct?

> 	size_t len;	/* Length of memory pointed to by buf */

(and maybe call this buflen)

--D

> 	char *buf;	/* Pattern to fill with */
> };
> 
> ioctl(fd, FIWRITESAME, struct write_same *arg)
> 
> 'pos' must be block size aligned.
> 'len' must be a power of two, or 0.  If 0, fill with zeroes.
> If len is shorter than the block size of the file, the kernel
> replicates the pattern in 'buf' within the single block.  If len
> is larger than block size, we're doing a multi-block WRITE_SAME.
> 
> We can implement this for block devices and any filesystem that
> cares.  The kernel will have to shoot down any page cache, just
> like for PUNCH_HOLE and similar.
> 
> 
> For a prototype, we can implement this in the NFS client, then hoist it
> to the VFS once the users have actually agreed this serves their needs.
>