On Wed, Jan 15, 2025 at 04:19:28PM +0000, Matthew Wilcox wrote: > On Wed, Jan 15, 2025 at 10:31:51AM -0500, Chuck Lever wrote: > > On 1/15/25 10:06 AM, Matthew Wilcox wrote: > > > On Tue, Jan 14, 2025 at 04:38:03PM -0500, Anna Schumaker wrote: > > > > I've seen a few requests for implementing the NFS v4.2 WRITE_SAME [1] operation over the last few months [2][3] to accelerate writing patterns of data on the server, so it's been in the back of my mind for a future project. I'll need to write some code somewhere so NFS & NFSD can handle this request. I could keep any implementation internal to NFS / NFSD, but I'd like to find out if local filesystems would find this sort of feature useful and if I should put it in the VFS instead. > > > > > > I think we need more information. I read over the [2] and [3] threads > > > and the spec. It _seems like_ the intent in the spec is to expose the > > > underlying SCSI WRITE SAME command over NFS, but at least one other > > > response in this thread has been to design an all-singing, all-dancing > > > superset that can write arbitrary sized blocks to arbitrary locations > > > in every file on every filesystem, and I think we're going to design > > > ourselves into an awful implementation if we do that. > > > > > > Can we confirm with the people who actually want to use this that all > > > they really want is to be able to do WRITE SAME as if they were on a > > > local disc, and then we can implement that in a matter of weeks instead > > > of taking a trip via Uranus. > > > > IME it's been very difficult to get such requesters to provide the > > detail we need to build to their requirements. Providing them with a > > limited prototype and letting them comment is likely the fastest way to > > converge on something useful. Press the Easy Button, then evolve. > > > > Trond has suggested starting with clone_file_range, providing it with a > > pattern and then have the VFS or file system fill exponentially larger > > segments of the file by replicating that pattern. The question is > > whether to let consumers simply use that API as it is, or shall we > > provide some kind of generic infrastructure over that that provides > > segment replication? > > > > With my NFSD hat on, I would prefer to have the file version of "write > > same" implemented outside of the NFS stack so that other consumers can > > benefit from using the very same implementation. NFSD (and the NFS > > client) should simply act as a conduit for these requests via the > > NFSv4.2 WRITE_SAME operation. > > > > I kinda like Dave's ideas too. Enabling offload will be critical to > > making this feature efficient and thus valuable. > > So I have some experience with designing an API like this one which may > prove either relevant or misleading. > > We have bzero() and memset(). If you want to fill with a larger pattern > than a single byte, POSIX does not provide. Various people have proposed > extensions, eg > https://github.com/ajkaijanaho/publib/blob/master/strutil/memfill.c > > But what people really want is the ability to use the x86 rep > movsw/movsl/movsq instructions. And so in Linux we now have > memset16/memset32/memset64/memset_l/memset_p which will map to one > of those hardware calls. Sure, we could implement memfill() and then > specialcase 2/4/8 byte implementations, but nobody actually wants to > use that. > > > So what API actually makes sense to provide? I suggest an ioctl, > implemented at the VFS layer: > > struct write_same { > loff_t pos; /* Where to start writing */ You probably need at least a: u64 count; /* Number of bytes to write */ Since I think the point is that you write buf[len] to the file/disk over and over again until count bytes have been written, correct? > size_t len; /* Length of memory pointed to by buf */ (and maybe call this buflen) --D > char *buf; /* Pattern to fill with */ > }; > > ioctl(fd, FIWRITESAME, struct write_same *arg) > > 'pos' must be block size aligned. > 'len' must be a power of two, or 0. If 0, fill with zeroes. > If len is shorter than the block size of the file, the kernel > replicates the pattern in 'buf' within the single block. If len > is larger than block size, we're doing a multi-block WRITE_SAME. > > We can implement this for block devices and any filesystem that > cares. The kernel will have to shoot down any page cache, just > like for PUNCH_HOLE and similar. > > > For a prototype, we can implement this in the NFS client, then hoist it > to the VFS once the users have actually agreed this serves their needs. >