On Wed, Mar 19, 2025 at 12:37:29PM -0600, Jens Axboe wrote: > On 3/19/25 11:45 AM, Joe Damato wrote: > > On Wed, Mar 19, 2025 at 11:20:50AM -0600, Jens Axboe wrote: > >> On 3/19/25 11:04 AM, Joe Damato wrote: > >>> On Wed, Mar 19, 2025 at 10:07:27AM -0600, Jens Axboe wrote: > >>>> On 3/19/25 9:32 AM, Joe Damato wrote: > >>>>> On Wed, Mar 19, 2025 at 01:04:48AM -0700, Christoph Hellwig wrote: > >>>>>> On Wed, Mar 19, 2025 at 12:15:11AM +0000, Joe Damato wrote: > >>>>>>> One way to fix this is to add zerocopy notifications to sendfile similar > >>>>>>> to how MSG_ZEROCOPY works with sendmsg. This is possible thanks to the > >>>>>>> extensive work done by Pavel [1]. > >>>>>> > >>>>>> What is a "zerocopy notification" > >>>>> > >>>>> See the docs on MSG_ZEROCOPY [1], but in short when a user app calls > >>>>> sendmsg and passes MSG_ZEROCOPY a completion notification is added > >>>>> to the error queue. The user app can poll for these to find out when > >>>>> the TX has completed and the buffer it passed to the kernel can be > >>>>> overwritten. > >>>>> > >>>>> My series provides the same functionality via splice and sendfile2. > >>>>> > >>>>> [1]: https://www.kernel.org/doc/html/v6.13/networking/msg_zerocopy.html > >>>>> > >>>>>> and why aren't you simply plugging this into io_uring and generate > >>>>>> a CQE so that it works like all other asynchronous operations? > >>>>> > >>>>> I linked to the iouring work that Pavel did in the cover letter. > >>>>> Please take a look. > >>>>> > >>>>> That work refactored the internals of how zerocopy completion > >>>>> notifications are wired up, allowing other pieces of code to use the > >>>>> same infrastructure and extend it, if needed. > >>>>> > >>>>> My series is using the same internals that iouring (and others) use > >>>>> to generate zerocopy completion notifications. Unlike iouring, > >>>>> though, I don't need a fully customized implementation with a new > >>>>> user API for harvesting completion events; I can use the existing > >>>>> mechanism already in the kernel that user apps already use for > >>>>> sendmsg (the error queue, as explained above and in the > >>>>> MSG_ZEROCOPY documentation). > >>>> > >>>> The error queue is arguably a work-around for _not_ having a delivery > >>>> mechanism that works with a sync syscall in the first place. The main > >>>> question here imho would be "why add a whole new syscall etc when > >>>> there's already an existing way to do accomplish this, with > >>>> free-to-reuse notifications". If the answer is "because splice", then it > >>>> would seem saner to plumb up those bits only. Would be much simpler > >>>> too... > >>> > >>> I may be misunderstanding your comment, but my response would be: > >>> > >>> There are existing apps which use sendfile today unsafely and > >>> it would be very nice to have a safe sendfile equivalent. Converting > >>> existing apps to using iouring (if I understood your suggestion?) > >>> would be significantly more work compared to calling sendfile2 and > >>> adding code to check the error queue. > >> > >> It's really not, if you just want to use it as a sync kind of thing. If > >> you want to have multiple things in flight etc, yeah it could be more > >> work, you'd also get better performance that way. And you could use > >> things like registered buffers for either of them, which again would > >> likely make it more efficient. > > > > I haven't argued that performance would be better using sendfile2 > > compared to iouring, just that existing apps which already use > > sendfile (but do so unsafely) would probably be more likely to use a > > safe alternative with existing examples of how to harvest completion > > notifications vs something more complex, like wrapping iouring. > > Sure and I get that, just not sure it'd be worth doing on the kernel > side for such (fairly) weak reasoning. The performance benefit is just a > side note in that if you did do it this way, you'd potentially be able > to run it more efficiently too. And regardless what people do or use > now, they are generally always interested in that aspect. Fair enough. > >> If you just use it as a sync thing, it'd be pretty trivial to just wrap > >> a my_sendfile_foo() in a submit_and_wait operation, which issues and > >> waits on the completion in a single syscall. And if you want to wait on > >> the notification too, you could even do that in the same syscall and > >> wait on 2 CQEs. That'd be a downright trivial way to provide a sync way > >> of doing the same thing. > > > > I don't disagree; I just don't know if app developers: > > a.) know that this is possible to do, and > > b.) know how to do it > > Writing that wrapper would be not even a screenful of code. Yes maybe > they don't know how to do it now, but it's _really_ trivial to do. It'd > take me roughly 1 min to do that, would be happy to help out with that > side so it could go into a commit or man page or whatever. I'd never be opposed to more documentation ;) > > In general: it does seem a bit odd to me that there isn't a safe > > sendfile syscall in Linux that uses existing completion notification > > mechanisms. > > Pretty natural, I think. sendfile(2) predates that by quite a bit, and > the last real change to sendfile was using splice underneath. Which I > did, and that was probably almost 20 years ago at this point... > > I do think it makes sense to have a sendfile that's both fast and > efficient, and can be used sanely with buffer reuse without relying on > odd heuristics. Just trying to tie this together in my head -- are you saying that you think the kernel internals of sendfile could be changed in a different way or that this a userland problem (and they should use the io_uring wrapper you suggested above) ? > >>> I would also argue that there are likely user apps out there that > >>> use both sendmsg MSG_ZEROCOPY for certain writes (for data in > >>> memory) and also use sendfile (for data on disk). One example would > >>> be a reverse proxy that might write HTTP headers to clients via > >>> sendmsg but transmit the response body with sendfile. > >>> > >>> For those apps, the code to check the error queue already exists for > >>> sendmsg + MSG_ZEROCOPY, so swapping in sendfile2 seems like an easy > >>> way to ensure safe sendfile usage. > >> > >> Sure that is certainly possible. I didn't say that wasn't the case, > >> rather that the error queue approach is a work-around in the first place > >> for not having some kind of async notification mechanism for when it's > >> free to reuse. > > > > Of course, I certainly agree that the error queue is a work around. > > But it works, app use it, and its fairly well known. I don't see any > > reason, other than historical context, why sendmsg can use this > > mechanism, splice can, but sendfile shouldn't? > > My argument would be the same as for other features - if you can do it > simpler this other way, why not consider that? The end result would be > the same, you can do fast sendfile() with sane buffer reuse. But the > kernel side would be simpler, which is always a kernel main goal for > those of us that have to maintain it. > > Just adding sendfile2() works in the sense that it's an easier drop in > replacement for an app, though the error queue side does mean it needs > to change anyway - it's not just replacing one syscall with another. And > if we want to be lazy, sure that's fine. I just don't think it's the > best way to do it when we literally have a mechanism that's designed for > this and works with reuse already with normal send zc (and receive side > too, in the next kernel). It seems like you've answered the question I asked above and that you are suggesting there might be a better and simpler sendfile2 kernel-side implementation that doesn't rely on splice internals at all. Am I following you? If so, I'll drop the sendfile2 stuff from this series and stick with the splice changes only, if you are (at a high level) OK with the idea of adding a flag for this to splice. In the meantime, I'll take a few more reads through the iouring code to see if I can work out how sendfile2 might be built on top of that instead of splice in the kernel. Thank you very much for your time, feedback, and attention, Joe