Re: Needed: ADB (WRITE_SAME) support in Linux nfsd

Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> · Fri, 10 Jan 2025 01:04:00 +0000

On Tue, 2025-01-07 at 11:55 -0500, Chuck Lever wrote:
> On 1/7/25 10:36 AM, Takeshi Nishimura wrote:
> > On Tue, Jan 7, 2025 at 4:10 PM Anna Schumaker
> > <anna.schumaker@xxxxxxxxxx> wrote:
> > >
> > > Hi Takeshi,
> > >
> > > On 1/6/25 6:56 PM, Takeshi Nishimura wrote:
> > > > Dear list,
> > > >
> > > > how can we get ADB (WRITE_SAME) support in (Debian) Linux nfsd,
> > > > and an
> > > > ioct() in Linux nfsd client to use it?
> > >
> > > Thanks for the request! Just so you're aware of the process, this
> > > email list is for upstream Linux kernel development. If we decide
> > > to go ahead with adding WRITE_SAME support it'll be up to Debian
> > > later to enable it (that part is out of our hands, and isn't up
> > > to us).
> >
> > I assume WRITE_SAME will not have a separate build flag, right?
> >
> > >
> > > >
> > > > We have a set of custom "big data" applications which could
> > > > greatly
> > > > benefit from such an acceleration ABI, both for implementing
> > > > "zero
> > > > data" (fill blocks with 0 bytes), and fill blocks with
> > > > identical data
> > > > patterns, without sending the same pattern over and over again
> > > > over
> > > > the network wire.
> > >
> > > Having said that, I'm not opposed to implementing WRITE_SAME. I
> > > wonder if we could somehow use it to build support for
> > > fallocate's FALLOC_FL_ZERO_RANGE flag at the same time.
> >
> > No, I am asking really for WRITE_SAME support to write identical
> > data
> > to multiple locations. Like
> > https://linux.die.net/man/8/sg_write_same
> > Writing zero bytes is just a subset, and not what we need.
> > WRITE_SAME
> > is intended as "big data" and database accelerator function.
> >
> > >
> > > I'm also wondering if there would be any advantage to local
> > > filesystems if this were to be implemented as a generic system
> > > call, rather than as an NFS-specific ioctl(), since some storage
> > > devices have a WRITE_SAME operation that could be used for
> > > acceleration. But I haven't convinced myself either way yet.
> >
> > Getting a new, generic syscall in Linux takes 3-5 years on average.
> > By
> > then our project will be finished, or renewed with new funding, but
> > all without getting a boost from WRITE_SAME support in NFS-
>
> For comparison:
>
> Adding WRITE_SAME to the Linux NFS client and server implementation
> is
> on the same order of time -- a year (or perhaps less), then getting
> it
> into Debian stable will be more than 1 year, probably 2 or 3 (at a
> guess).
>
> A better approach would be for your team to implement what they need,
> use it for your project (ie, custom build your kernels), then
> contribute
> it to upstream so others can use it too. That would demonstrate there
> is
> real user demand for this facility, and your code will have gained
> some
> miles in production.
>
> You could hire a consultant to implement it for you on a time frame
> that
> is your choosing.
>
> Upstream prioritizes economy of maintenance over code velocity;
> meaning,
> how quickly a feature can be prototyped and productized is less
> important to us than how much the feature will cost us to maintain in
> the long run.
>
> With my NFSD co-maintainer hat on: I would accept a WRITE_SAME
> implementation, but it would have to come with tests -- pynfs and
> xfstests are the usual test harnesses that can accommodate those.
>
> In addition, NFSD is responsible only for the network protocol. The
> local file system implementations have to handle the heavy lifting.
> It's not clear to me what infrastructure is already available in
> Linux
> file systems; that will take some research. (I think that is what
> Anna was hinting at).
>

This functionality should be possible to implement using the
clone_range ioctl() on the server or on the client for that matter.

Yes, you'll have to use multiple clone_range calls, but you can use a
geometric series to do it efficiently (i.e. write pattern, clone
pattern, clone 2*pattern, clone 4*pattern, etc....).

It's not hard to do, and the advantage is that it can work for all
filesystems that implement clone_range. You'd not be limited to just
using NFS with a special WRITE_SAME ioctl. Furthermore, doing it this
way is space-efficent on most filesystems.

--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@xxxxxxxxxxxxxxx