Re: [LSF/MM/BPF TOPIC] Implementing the NFS v4.2 WRITE_SAME operation: VFS or NFS ioctl() ?

Chuck Lever <chuck.lever@xxxxxxxxxx> · Thu, 16 Jan 2025 08:59:19 -0500

On 1/16/25 8:37 AM, Theodore Ts'o wrote:
On Wed, Jan 15, 2025 at 09:42:29PM -0800, Christoph Hellwig wrote:
On Wed, Jan 15, 2025 at 10:14:56AM +1100, Dave Chinner wrote:
How closely does this match to the block device WRITE_SAME
(SCSI/NVMe) commands? I note there is a reference to this in the
RFC, but there are no details given.

There is no write same in NVMe.  In one of the few wiѕe choices in
NVMe the protocol only does a write zeroes for zeroing instead of the
overly complex write zeroes.  And no one has complained about that so
far.

It should be noted that there is currently a patch proposing to add to
fallocate support for the operation FALLOC_FL_WRITE_ZEROS:

https://lore.kernel.org/all/20250115114637.2705887-1-yi.zhang@xxxxxxxxxxxxxxx/

For those use cases where this is all the user requires, perhaps this
is something that Linux's nfs4 client should consider implementing?

I've seen one or two other mentions of "let's make the NFS client do
such and such" in this thread.

To be clear: The proposal includes client and server implementation of
the NFSv4.2 WRITE_SAME operation. This is not a client-only thing.

In fact, the most recent requester mentioned only a server
implementation because they have a client that already implements
WRITE_SAME and want this feature in NFSD.

In any case I'd suggest that interested file system developers comment
on this patch series.

Personally, I have no interest in using or implementing in a
WRITE_SAME operation which implements the all-singing, all-dancing
WRITE_SAME as envisioned by the SCSI and NFSv4.2 specifications.

I think we need to consider a weak generic implementation that resides
in the VFS or a library for file systems that choose not to implement.

I will also note that many Cloud vendors (AWS, GCE, Azure) are moving
to using NVMe instead of SCSI, especially for the higher performance
VM and software-defined block devices.  So, I would suspect that a
customer would have to wave a **very** large amount of money under my
employer's nose before this would be something that would be funded by
$WORK for block-based file systems (and even then, it appears that
NVMe is so much better at higher performance storage, such that I'm
not sure how many customers would really be all that interested).

But hey, if someone knows of some AI-related workload that needs to
write the same non-zero block a very large number of times, let me
know.  :-)

See my previous reply in this thread: WRITE_SAME has a long-standing
existing use case in the database world. The NFSv4.2 WRITE_SAME
operation was designed around this use case.

You remember database workloads, right? ;-)

--
Chuck Lever