Re: [LSF/MM/BPF TOPIC] Implementing the NFS v4.2 WRITE_SAME operation: VFS or NFS ioctl() ?

Matthew Wilcox <willy@xxxxxxxxxxxxx> · Wed, 15 Jan 2025 16:19:28 +0000

On Wed, Jan 15, 2025 at 10:31:51AM -0500, Chuck Lever wrote:
> On 1/15/25 10:06 AM, Matthew Wilcox wrote:
> > On Tue, Jan 14, 2025 at 04:38:03PM -0500, Anna Schumaker wrote:
> > > I've seen a few requests for implementing the NFS v4.2 WRITE_SAME [1] operation over the last few months [2][3] to accelerate writing patterns of data on the server, so it's been in the back of my mind for a future project. I'll need to write some code somewhere so NFS & NFSD can handle this request. I could keep any implementation internal to NFS / NFSD, but I'd like to find out if local filesystems would find this sort of feature useful and if I should put it in the VFS instead.
> > 
> > I think we need more information.  I read over the [2] and [3] threads
> > and the spec.  It _seems like_ the intent in the spec is to expose the
> > underlying SCSI WRITE SAME command over NFS, but at least one other
> > response in this thread has been to design an all-singing, all-dancing
> > superset that can write arbitrary sized blocks to arbitrary locations
> > in every file on every filesystem, and I think we're going to design
> > ourselves into an awful implementation if we do that.
> > 
> > Can we confirm with the people who actually want to use this that all
> > they really want is to be able to do WRITE SAME as if they were on a
> > local disc, and then we can implement that in a matter of weeks instead
> > of taking a trip via Uranus.
> 
> IME it's been very difficult to get such requesters to provide the
> detail we need to build to their requirements. Providing them with a
> limited prototype and letting them comment is likely the fastest way to
> converge on something useful. Press the Easy Button, then evolve.
> 
> Trond has suggested starting with clone_file_range, providing it with a
> pattern and then have the VFS or file system fill exponentially larger
> segments of the file by replicating that pattern. The question is
> whether to let consumers simply use that API as it is, or shall we
> provide some kind of generic infrastructure over that that provides
> segment replication?
> 
> With my NFSD hat on, I would prefer to have the file version of "write
> same" implemented outside of the NFS stack so that other consumers can
> benefit from using the very same implementation. NFSD (and the NFS
> client) should simply act as a conduit for these requests via the
> NFSv4.2 WRITE_SAME operation.
> 
> I kinda like Dave's ideas too. Enabling offload will be critical to
> making this feature efficient and thus valuable.

So I have some experience with designing an API like this one which may
prove either relevant or misleading.

We have bzero() and memset().  If you want to fill with a larger pattern
than a single byte, POSIX does not provide.  Various people have proposed
extensions, eg
https://github.com/ajkaijanaho/publib/blob/master/strutil/memfill.c

But what people really want is the ability to use the x86 rep
movsw/movsl/movsq instructions.  And so in Linux we now have
memset16/memset32/memset64/memset_l/memset_p which will map to one
of those hardware calls.  Sure, we could implement memfill() and then
specialcase 2/4/8 byte implementations, but nobody actually wants to
use that.

So what API actually makes sense to provide?  I suggest an ioctl,
implemented at the VFS layer:

struct write_same {
	loff_t pos;	/* Where to start writing */
	size_t len;	/* Length of memory pointed to by buf */
	char *buf;	/* Pattern to fill with */
};

ioctl(fd, FIWRITESAME, struct write_same *arg)

'pos' must be block size aligned.
'len' must be a power of two, or 0.  If 0, fill with zeroes.
If len is shorter than the block size of the file, the kernel
replicates the pattern in 'buf' within the single block.  If len
is larger than block size, we're doing a multi-block WRITE_SAME.

We can implement this for block devices and any filesystem that
cares.  The kernel will have to shoot down any page cache, just
like for PUNCH_HOLE and similar.

For a prototype, we can implement this in the NFS client, then hoist it
to the VFS once the users have actually agreed this serves their needs.