Re: Needed: ADB (WRITE_SAME) support in Linux nfsd

Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> · Fri, 10 Jan 2025 18:40:34 +0000

On Fri, 2025-01-10 at 06:14 +0100, Takeshi Nishimura wrote:
> On Fri, Jan 10, 2025 at 2:04 AM Trond Myklebust
> <trondmy@xxxxxxxxxxxxxxx> wrote:
> > 
> > On Tue, 2025-01-07 at 11:55 -0500, Chuck Lever wrote:
> > > On 1/7/25 10:36 AM, Takeshi Nishimura wrote:
> > > > On Tue, Jan 7, 2025 at 4:10 PM Anna Schumaker
> > > > <anna.schumaker@xxxxxxxxxx> wrote:
> > > > > 
> > > > > Hi Takeshi,
> > > > > 
> > > > > On 1/6/25 6:56 PM, Takeshi Nishimura wrote:
> > > > > > Dear list,
> > > > > > 
> > > > > > how can we get ADB (WRITE_SAME) support in (Debian) Linux
> > > > > > nfsd,
> > > > > > and an
> > > > > > ioct() in Linux nfsd client to use it?
> > > > > 
> > > > > Thanks for the request! Just so you're aware of the process,
> > > > > this
> > > > > email list is for upstream Linux kernel development. If we
> > > > > decide
> > > > > to go ahead with adding WRITE_SAME support it'll be up to
> > > > > Debian
> > > > > later to enable it (that part is out of our hands, and isn't
> > > > > up
> > > > > to us).
> > > > 
> > > > I assume WRITE_SAME will not have a separate build flag, right?
> > > > 
> > > > > 
> > > > > > 
> > > > > > We have a set of custom "big data" applications which could
> > > > > > greatly
> > > > > > benefit from such an acceleration ABI, both for
> > > > > > implementing
> > > > > > "zero
> > > > > > data" (fill blocks with 0 bytes), and fill blocks with
> > > > > > identical data
> > > > > > patterns, without sending the same pattern over and over
> > > > > > again
> > > > > > over
> > > > > > the network wire.
> > > > > 
> > > > > Having said that, I'm not opposed to implementing WRITE_SAME.
> > > > > I
> > > > > wonder if we could somehow use it to build support for
> > > > > fallocate's FALLOC_FL_ZERO_RANGE flag at the same time.
> > > > 
> > > > No, I am asking really for WRITE_SAME support to write
> > > > identical
> > > > data
> > > > to multiple locations. Like
> > > > https://linux.die.net/man/8/sg_write_same
> > > > Writing zero bytes is just a subset, and not what we need.
> > > > WRITE_SAME
> > > > is intended as "big data" and database accelerator function.
> > > > 
> > > > > 
> > > > > I'm also wondering if there would be any advantage to local
> > > > > filesystems if this were to be implemented as a generic
> > > > > system
> > > > > call, rather than as an NFS-specific ioctl(), since some
> > > > > storage
> > > > > devices have a WRITE_SAME operation that could be used for
> > > > > acceleration. But I haven't convinced myself either way yet.
> > > > 
> > > > Getting a new, generic syscall in Linux takes 3-5 years on
> > > > average.
> > > > By
> > > > then our project will be finished, or renewed with new funding,
> > > > but
> > > > all without getting a boost from WRITE_SAME support in NFS-
> > > 
> > > For comparison:
> > > 
> > > Adding WRITE_SAME to the Linux NFS client and server
> > > implementation
> > > is
> > > on the same order of time -- a year (or perhaps less), then
> > > getting
> > > it
> > > into Debian stable will be more than 1 year, probably 2 or 3 (at
> > > a
> > > guess).
> > > 
> > > A better approach would be for your team to implement what they
> > > need,
> > > use it for your project (ie, custom build your kernels), then
> > > contribute
> > > it to upstream so others can use it too. That would demonstrate
> > > there
> > > is
> > > real user demand for this facility, and your code will have
> > > gained
> > > some
> > > miles in production.
> > > 
> > > You could hire a consultant to implement it for you on a time
> > > frame
> > > that
> > > is your choosing.
> > > 
> > > Upstream prioritizes economy of maintenance over code velocity;
> > > meaning,
> > > how quickly a feature can be prototyped and productized is less
> > > important to us than how much the feature will cost us to
> > > maintain in
> > > the long run.
> > > 
> > > With my NFSD co-maintainer hat on: I would accept a WRITE_SAME
> > > implementation, but it would have to come with tests -- pynfs and
> > > xfstests are the usual test harnesses that can accommodate those.
> > > 
> > > In addition, NFSD is responsible only for the network protocol.
> > > The
> > > local file system implementations have to handle the heavy
> > > lifting.
> > > It's not clear to me what infrastructure is already available in
> > > Linux
> > > file systems; that will take some research. (I think that is what
> > > Anna was hinting at).
> > > 
> > 
> > This functionality should be possible to implement using the
> > clone_range ioctl() on the server or on the client for that matter.
> > 
> > Yes, you'll have to use multiple clone_range calls, but you can use
> > a
> > geometric series to do it efficiently (i.e. write pattern, clone
> > pattern, clone 2*pattern, clone 4*pattern, etc....).
> > 
> > It's not hard to do, and the advantage is that it can work for all
> > filesystems that implement clone_range. You'd not be limited to
> > just
> > using NFS with a special WRITE_SAME ioctl. Furthermore, doing it
> > this
> > way is space-efficent on most filesystems.
> > 
> 
> What will happen if someone else writes into the same location while
> the geometric series is running?
> Should WRITE_SAME not be atomic, or at least protect against other
> writes destroying the data?

No. There is no requirement or promise of atomicity for WRITE_SAME in
RFC7862, nor is there any requirement that the server will lock or deny
other writes.

In fact in section 15.12.3. the spec states explicitly that that
partial completion may occur, just like for ordinary WRITE. It also
states that the NFS client should use locking if it requires
serialisation w.r.t other I/O operations. Finally it describes the
asynchronous behaviour, and how the client can track progress in the
case of large WRITE_SAME requests that cannot be handled quickly enough
to be synchronous.

So if you require atomicity, then you need to look somewhere else.

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@xxxxxxxxxxxxxxx