Re: [RFC -next 00/10] Add ZC notifications to splice and sendfile

Joe Damato <jdamato@xxxxxxxxxx> · Wed, 19 Mar 2025 12:16:07 -0700

On Wed, Mar 19, 2025 at 12:37:29PM -0600, Jens Axboe wrote:
> On 3/19/25 11:45 AM, Joe Damato wrote:
> > On Wed, Mar 19, 2025 at 11:20:50AM -0600, Jens Axboe wrote:
> >> On 3/19/25 11:04 AM, Joe Damato wrote:
> >>> On Wed, Mar 19, 2025 at 10:07:27AM -0600, Jens Axboe wrote:
> >>>> On 3/19/25 9:32 AM, Joe Damato wrote:
> >>>>> On Wed, Mar 19, 2025 at 01:04:48AM -0700, Christoph Hellwig wrote:
> >>>>>> On Wed, Mar 19, 2025 at 12:15:11AM +0000, Joe Damato wrote:
> >>>>>>> One way to fix this is to add zerocopy notifications to sendfile similar
> >>>>>>> to how MSG_ZEROCOPY works with sendmsg. This is possible thanks to the
> >>>>>>> extensive work done by Pavel [1].
> >>>>>>
> >>>>>> What is a "zerocopy notification" 
> >>>>>
> >>>>> See the docs on MSG_ZEROCOPY [1], but in short when a user app calls
> >>>>> sendmsg and passes MSG_ZEROCOPY a completion notification is added
> >>>>> to the error queue. The user app can poll for these to find out when
> >>>>> the TX has completed and the buffer it passed to the kernel can be
> >>>>> overwritten.
> >>>>>
> >>>>> My series provides the same functionality via splice and sendfile2.
> >>>>>
> >>>>> [1]: https://www.kernel.org/doc/html/v6.13/networking/msg_zerocopy.html
> >>>>>
> >>>>>> and why aren't you simply plugging this into io_uring and generate
> >>>>>> a CQE so that it works like all other asynchronous operations?
> >>>>>
> >>>>> I linked to the iouring work that Pavel did in the cover letter.
> >>>>> Please take a look.
> >>>>>
> >>>>> That work refactored the internals of how zerocopy completion
> >>>>> notifications are wired up, allowing other pieces of code to use the
> >>>>> same infrastructure and extend it, if needed.
> >>>>>
> >>>>> My series is using the same internals that iouring (and others) use
> >>>>> to generate zerocopy completion notifications. Unlike iouring,
> >>>>> though, I don't need a fully customized implementation with a new
> >>>>> user API for harvesting completion events; I can use the existing
> >>>>> mechanism already in the kernel that user apps already use for
> >>>>> sendmsg (the error queue, as explained above and in the
> >>>>> MSG_ZEROCOPY documentation).
> >>>>
> >>>> The error queue is arguably a work-around for _not_ having a delivery
> >>>> mechanism that works with a sync syscall in the first place. The main
> >>>> question here imho would be "why add a whole new syscall etc when
> >>>> there's already an existing way to do accomplish this, with
> >>>> free-to-reuse notifications". If the answer is "because splice", then it
> >>>> would seem saner to plumb up those bits only. Would be much simpler
> >>>> too...
> >>>
> >>> I may be misunderstanding your comment, but my response would be:
> >>>
> >>>   There are existing apps which use sendfile today unsafely and
> >>>   it would be very nice to have a safe sendfile equivalent. Converting
> >>>   existing apps to using iouring (if I understood your suggestion?)
> >>>   would be significantly more work compared to calling sendfile2 and
> >>>   adding code to check the error queue.
> >>
> >> It's really not, if you just want to use it as a sync kind of thing. If
> >> you want to have multiple things in flight etc, yeah it could be more
> >> work, you'd also get better performance that way. And you could use
> >> things like registered buffers for either of them, which again would
> >> likely make it more efficient.
> > 
> > I haven't argued that performance would be better using sendfile2
> > compared to iouring, just that existing apps which already use
> > sendfile (but do so unsafely) would probably be more likely to use a
> > safe alternative with existing examples of how to harvest completion
> > notifications vs something more complex, like wrapping iouring.
> 
> Sure and I get that, just not sure it'd be worth doing on the kernel
> side for such (fairly) weak reasoning. The performance benefit is just a
> side note in that if you did do it this way, you'd potentially be able
> to run it more efficiently too. And regardless what people do or use
> now, they are generally always interested in that aspect.

Fair enough.

> >> If you just use it as a sync thing, it'd be pretty trivial to just wrap
> >> a my_sendfile_foo() in a submit_and_wait operation, which issues and
> >> waits on the completion in a single syscall. And if you want to wait on
> >> the notification too, you could even do that in the same syscall and
> >> wait on 2 CQEs. That'd be a downright trivial way to provide a sync way
> >> of doing the same thing.
> > 
> > I don't disagree; I just don't know if app developers:
> >   a.) know that this is possible to do, and
> >   b.) know how to do it
> 
> Writing that wrapper would be not even a screenful of code. Yes maybe
> they don't know how to do it now, but it's _really_ trivial to do. It'd
> take me roughly 1 min to do that, would be happy to help out with that
> side so it could go into a commit or man page or whatever.

I'd never be opposed to more documentation ;)

> > In general: it does seem a bit odd to me that there isn't a safe
> > sendfile syscall in Linux that uses existing completion notification
> > mechanisms.
> 
> Pretty natural, I think. sendfile(2) predates that by quite a bit, and
> the last real change to sendfile was using splice underneath. Which I
> did, and that was probably almost 20 years ago at this point...
> 
> I do think it makes sense to have a sendfile that's both fast and
> efficient, and can be used sanely with buffer reuse without relying on
> odd heuristics.

Just trying to tie this together in my head -- are you saying that
you think the kernel internals of sendfile could be changed in a
different way or that this a userland problem (and they should use
the io_uring wrapper you suggested above) ?

> >>> I would also argue that there are likely user apps out there that
> >>> use both sendmsg MSG_ZEROCOPY for certain writes (for data in
> >>> memory) and also use sendfile (for data on disk). One example would
> >>> be a reverse proxy that might write HTTP headers to clients via
> >>> sendmsg but transmit the response body with sendfile.
> >>>
> >>> For those apps, the code to check the error queue already exists for
> >>> sendmsg + MSG_ZEROCOPY, so swapping in sendfile2 seems like an easy
> >>> way to ensure safe sendfile usage.
> >>
> >> Sure that is certainly possible. I didn't say that wasn't the case,
> >> rather that the error queue approach is a work-around in the first place
> >> for not having some kind of async notification mechanism for when it's
> >> free to reuse.
> > 
> > Of course, I certainly agree that the error queue is a work around.
> > But it works, app use it, and its fairly well known. I don't see any
> > reason, other than historical context, why sendmsg can use this
> > mechanism, splice can, but sendfile shouldn't?
> 
> My argument would be the same as for other features - if you can do it
> simpler this other way, why not consider that? The end result would be
> the same, you can do fast sendfile() with sane buffer reuse. But the
> kernel side would be simpler, which is always a kernel main goal for
> those of us that have to maintain it.
>
> Just adding sendfile2() works in the sense that it's an easier drop in
> replacement for an app, though the error queue side does mean it needs
> to change anyway - it's not just replacing one syscall with another. And
> if we want to be lazy, sure that's fine. I just don't think it's the
> best way to do it when we literally have a mechanism that's designed for
> this and works with reuse already with normal send zc (and receive side
> too, in the next kernel).

It seems like you've answered the question I asked above and that
you are suggesting there might be a better and simpler sendfile2
kernel-side implementation that doesn't rely on splice internals at
all.

Am I following you? If so, I'll drop the sendfile2 stuff from this
series and stick with the splice changes only, if you are (at a high
level) OK with the idea of adding a flag for this to splice.

In the meantime, I'll take a few more reads through the iouring code
to see if I can work out how sendfile2 might be built on top of that
instead of splice in the kernel.

Thank you very much for your time, feedback, and attention,
Joe