Re: [LSF/MM/BPF TOPIC] Implementing the NFS v4.2 WRITE_SAME operation: VFS or NFS ioctl() ?

Chuck Lever <chuck.lever@xxxxxxxxxx> · Wed, 15 Jan 2025 13:43:14 -0500

On 1/15/25 11:19 AM, Matthew Wilcox wrote:
On Wed, Jan 15, 2025 at 10:31:51AM -0500, Chuck Lever wrote:
On 1/15/25 10:06 AM, Matthew Wilcox wrote:
On Tue, Jan 14, 2025 at 04:38:03PM -0500, Anna Schumaker wrote:
I've seen a few requests for implementing the NFS v4.2 WRITE_SAME [1] operation over the last few months [2][3] to accelerate writing patterns of data on the server, so it's been in the back of my mind for a future project. I'll need to write some code somewhere so NFS & NFSD can handle this request. I could keep any implementation internal to NFS / NFSD, but I'd like to find out if local filesystems would find this sort of feature useful and if I should put it in the VFS instead.

I think we need more information.  I read over the [2] and [3] threads
and the spec.  It _seems like_ the intent in the spec is to expose the
underlying SCSI WRITE SAME command over NFS, but at least one other
response in this thread has been to design an all-singing, all-dancing
superset that can write arbitrary sized blocks to arbitrary locations
in every file on every filesystem, and I think we're going to design
ourselves into an awful implementation if we do that.

Can we confirm with the people who actually want to use this that all
they really want is to be able to do WRITE SAME as if they were on a
local disc, and then we can implement that in a matter of weeks instead
of taking a trip via Uranus.

IME it's been very difficult to get such requesters to provide the
detail we need to build to their requirements. Providing them with a
limited prototype and letting them comment is likely the fastest way to
converge on something useful. Press the Easy Button, then evolve.

Trond has suggested starting with clone_file_range, providing it with a
pattern and then have the VFS or file system fill exponentially larger
segments of the file by replicating that pattern. The question is
whether to let consumers simply use that API as it is, or shall we
provide some kind of generic infrastructure over that that provides
segment replication?

With my NFSD hat on, I would prefer to have the file version of "write
same" implemented outside of the NFS stack so that other consumers can
benefit from using the very same implementation. NFSD (and the NFS
client) should simply act as a conduit for these requests via the
NFSv4.2 WRITE_SAME operation.

I kinda like Dave's ideas too. Enabling offload will be critical to
making this feature efficient and thus valuable.

So I have some experience with designing an API like this one which may
prove either relevant or misleading.

We have bzero() and memset().  If you want to fill with a larger pattern
than a single byte, POSIX does not provide.  Various people have proposed
extensions, eg
https://github.com/ajkaijanaho/publib/blob/master/strutil/memfill.c

But what people really want is the ability to use the x86 rep
movsw/movsl/movsq instructions.  And so in Linux we now have
memset16/memset32/memset64/memset_l/memset_p which will map to one
of those hardware calls.  Sure, we could implement memfill() and then
specialcase 2/4/8 byte implementations, but nobody actually wants to
use that.

So what API actually makes sense to provide?  I suggest an ioctl,
implemented at the VFS layer:

struct write_same {
	loff_t pos;	/* Where to start writing */
	size_t len;	/* Length of memory pointed to by buf */
	char *buf;	/* Pattern to fill with */
};

ioctl(fd, FIWRITESAME, struct write_same *arg)

This might be a controversial opinion, but a new ioctl() seems OK to me.

'pos' must be block size aligned.
'len' must be a power of two, or 0.  If 0, fill with zeroes.
If len is shorter than the block size of the file, the kernel
replicates the pattern in 'buf' within the single block.  If len
is larger than block size, we're doing a multi-block WRITE_SAME.

NFS WRITE_SAME has no alignment restrictions that I'm aware of. Also, I
think it allows the pattern to comb through a file, writing, say, every
other byte, and leaving the unwritten bytes unchanged.

Win32-API has a similar facility with no alignment restrictions and the
ability to comb; in addition it does not seem to set a limit on the size
of the pattern.

So, if we start with a simple struct write_same, I would say we want to
provide some API extensibility guarantees, or simply agree that this
form of the API will exist only in prototype.

Fwiw, use cases here are typically databases that want to quickly
initialize files that will store tables. The head and tail of each
ADB are sentinels for detecting torn writes, and the middle
segment is typically zeroes or a poison pattern.

We can implement this for block devices and any filesystem that
cares.  The kernel will have to shoot down any page cache, just
like for PUNCH_HOLE and similar.

For a prototype, we can implement this in the NFS client, then hoist it
to the VFS once the users have actually agreed this serves their needs.

To be clear, NFSD also needs to do handle WRITE_SAME. Would the
prototype server handle that using clone_file_range ?

--
Chuck Lever