Re: Needed: ADB (WRITE_SAME) support in Linux nfsd

Dan Shelton <dan.f.shelton@xxxxxxxxx> · Fri, 10 Jan 2025 01:00:58 +0100

On Tue, 7 Jan 2025 at 17:56, Chuck Lever <chuck.lever@xxxxxxxxxx> wrote:
>
> On 1/7/25 10:36 AM, Takeshi Nishimura wrote:
> > On Tue, Jan 7, 2025 at 4:10 PM Anna Schumaker <anna.schumaker@xxxxxxxxxx> wrote:
> >>
> >> Hi Takeshi,
> >>
> >> On 1/6/25 6:56 PM, Takeshi Nishimura wrote:
> >>> Dear list,
> >>>
> >>> how can we get ADB (WRITE_SAME) support in (Debian) Linux nfsd, and an
> >>> ioct() in Linux nfsd client to use it?
> >>
> >> Thanks for the request! Just so you're aware of the process, this email list is for upstream Linux kernel development. If we decide to go ahead with adding WRITE_SAME support it'll be up to Debian later to enable it (that part is out of our hands, and isn't up to us).
> >
> > I assume WRITE_SAME will not have a separate build flag, right?
> >
> >>
> >>>
> >>> We have a set of custom "big data" applications which could greatly
> >>> benefit from such an acceleration ABI, both for implementing "zero
> >>> data" (fill blocks with 0 bytes), and fill blocks with identical data
> >>> patterns, without sending the same pattern over and over again over
> >>> the network wire.
> >>
> >> Having said that, I'm not opposed to implementing WRITE_SAME. I wonder if we could somehow use it to build support for fallocate's FALLOC_FL_ZERO_RANGE flag at the same time.
> >
> > No, I am asking really for WRITE_SAME support to write identical data
> > to multiple locations. Like https://linux.die.net/man/8/sg_write_same
> > Writing zero bytes is just a subset, and not what we need. WRITE_SAME
> > is intended as "big data" and database accelerator function.
> >
> >>
> >> I'm also wondering if there would be any advantage to local filesystems if this were to be implemented as a generic system call, rather than as an NFS-specific ioctl(), since some storage devices have a WRITE_SAME operation that could be used for acceleration. But I haven't convinced myself either way yet.
> >
> > Getting a new, generic syscall in Linux takes 3-5 years on average. By
> > then our project will be finished, or renewed with new funding, but
> > all without getting a boost from WRITE_SAME support in NFS-
>
> For comparison:
>
> Adding WRITE_SAME to the Linux NFS client and server implementation is
> on the same order of time -- a year (or perhaps less), then getting it
> into Debian stable will be more than 1 year, probably 2 or 3 (at a
> guess).
>
> A better approach would be for your team to implement what they need,
> use it for your project (ie, custom build your kernels), then contribute
> it to upstream so others can use it too. That would demonstrate there is
> real user demand for this facility, and your code will have gained some
> miles in production.

How should this work? The Linux nfs subsystem has become so incredibly
complex that there are only a few people who actually can work on it.
So "implement it yourself" is basically saying "it will never happen".

>
> You could hire a consultant to implement it for you on a time frame that
> is your choosing.

Could you please send me a list of qualified people? We've tried Tech
Recruiters in NYC, but the results were not good, so absurdly
expensive that just using Windows with SMB3.1 is a cheaper option, or
just people who plainly have no idea what they are talking about

> In addition, NFSD is responsible only for the network protocol. The
> local file system implementations have to handle the heavy lifting.
> It's not clear to me what infrastructure is already available in Linux
> file systems; that will take some research. (I think that is what
> Anna was hinting at).

No, this thinking is wrong. The main bottleneck is the network, or
better, the overhead of sending repeated data (pattern fill for big
data, zero fill and 0xff/0xdd fill for databases) over the wire, which
reduces the network traffic DRAMATICALLY (factor 70 with SMB3.1).

So tacking WRITE_SAME as ioctl() on client side, and expansion as loop
over write() in nfsd would be reasonable as the first implementation
of WRITE_SAME.

What IMO is not reasonable is to say we have to add a super-API which
covers all filesystems and all use cases, and somehow even connects to
sg_write_same(8) too, and all that in a single patch.
That would really take a year, and really would involve everyone at
kernel.org, becoming a F-35-like job generator for everyone.
-- 
Dan Shelton - Cluster Specialist Win/Lin/Bsd