Re: Needed: ADB (WRITE_SAME) support in Linux nfsd

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, Jan 10, 2025 at 2:04 AM Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> wrote:
>
> On Tue, 2025-01-07 at 11:55 -0500, Chuck Lever wrote:
> > On 1/7/25 10:36 AM, Takeshi Nishimura wrote:
> > > On Tue, Jan 7, 2025 at 4:10 PM Anna Schumaker
> > > <anna.schumaker@xxxxxxxxxx> wrote:
> > > >
> > > > Hi Takeshi,
> > > >
> > > > On 1/6/25 6:56 PM, Takeshi Nishimura wrote:
> > > > > Dear list,
> > > > >
> > > > > how can we get ADB (WRITE_SAME) support in (Debian) Linux nfsd,
> > > > > and an
> > > > > ioct() in Linux nfsd client to use it?
> > > >
> > > > Thanks for the request! Just so you're aware of the process, this
> > > > email list is for upstream Linux kernel development. If we decide
> > > > to go ahead with adding WRITE_SAME support it'll be up to Debian
> > > > later to enable it (that part is out of our hands, and isn't up
> > > > to us).
> > >
> > > I assume WRITE_SAME will not have a separate build flag, right?
> > >
> > > >
> > > > >
> > > > > We have a set of custom "big data" applications which could
> > > > > greatly
> > > > > benefit from such an acceleration ABI, both for implementing
> > > > > "zero
> > > > > data" (fill blocks with 0 bytes), and fill blocks with
> > > > > identical data
> > > > > patterns, without sending the same pattern over and over again
> > > > > over
> > > > > the network wire.
> > > >
> > > > Having said that, I'm not opposed to implementing WRITE_SAME. I
> > > > wonder if we could somehow use it to build support for
> > > > fallocate's FALLOC_FL_ZERO_RANGE flag at the same time.
> > >
> > > No, I am asking really for WRITE_SAME support to write identical
> > > data
> > > to multiple locations. Like
> > > https://linux.die.net/man/8/sg_write_same
> > > Writing zero bytes is just a subset, and not what we need.
> > > WRITE_SAME
> > > is intended as "big data" and database accelerator function.
> > >
> > > >
> > > > I'm also wondering if there would be any advantage to local
> > > > filesystems if this were to be implemented as a generic system
> > > > call, rather than as an NFS-specific ioctl(), since some storage
> > > > devices have a WRITE_SAME operation that could be used for
> > > > acceleration. But I haven't convinced myself either way yet.
> > >
> > > Getting a new, generic syscall in Linux takes 3-5 years on average.
> > > By
> > > then our project will be finished, or renewed with new funding, but
> > > all without getting a boost from WRITE_SAME support in NFS-
> >
> > For comparison:
> >
> > Adding WRITE_SAME to the Linux NFS client and server implementation
> > is
> > on the same order of time -- a year (or perhaps less), then getting
> > it
> > into Debian stable will be more than 1 year, probably 2 or 3 (at a
> > guess).
> >
> > A better approach would be for your team to implement what they need,
> > use it for your project (ie, custom build your kernels), then
> > contribute
> > it to upstream so others can use it too. That would demonstrate there
> > is
> > real user demand for this facility, and your code will have gained
> > some
> > miles in production.
> >
> > You could hire a consultant to implement it for you on a time frame
> > that
> > is your choosing.
> >
> > Upstream prioritizes economy of maintenance over code velocity;
> > meaning,
> > how quickly a feature can be prototyped and productized is less
> > important to us than how much the feature will cost us to maintain in
> > the long run.
> >
> > With my NFSD co-maintainer hat on: I would accept a WRITE_SAME
> > implementation, but it would have to come with tests -- pynfs and
> > xfstests are the usual test harnesses that can accommodate those.
> >
> > In addition, NFSD is responsible only for the network protocol. The
> > local file system implementations have to handle the heavy lifting.
> > It's not clear to me what infrastructure is already available in
> > Linux
> > file systems; that will take some research. (I think that is what
> > Anna was hinting at).
> >
>
> This functionality should be possible to implement using the
> clone_range ioctl() on the server or on the client for that matter.
>
> Yes, you'll have to use multiple clone_range calls, but you can use a
> geometric series to do it efficiently (i.e. write pattern, clone
> pattern, clone 2*pattern, clone 4*pattern, etc....).
>
> It's not hard to do, and the advantage is that it can work for all
> filesystems that implement clone_range. You'd not be limited to just
> using NFS with a special WRITE_SAME ioctl. Furthermore, doing it this
> way is space-efficent on most filesystems.
>

What will happen if someone else writes into the same location while
the geometric series is running?
Should WRITE_SAME not be atomic, or at least protect against other
writes destroying the data?
-- 
Internationalization&localization dev / 大阪大学
Takeshi Nishimura <takeshi.nishimura.linux@xxxxxxxxx>





[Index of Archives]     [Linux Filesystem Development]     [Linux USB Development]     [Linux Media Development]     [Video for Linux]     [Linux NILFS]     [Linux Audio Users]     [Yosemite Info]     [Linux SCSI]

  Powered by Linux