On 6/11/21 8:46 PM, Olivier Langlois wrote: > the man page says the following: > > If succesful, the resulting CQE will have IORING_CQE_F_BUFFER set > in the flags part of the struct, and the upper IORING_CQE_BUFFER_SHIFT > bits will contain the ID of the selected buffers. > > in order to respect this contract, the buffer is stored back in case of > an error. There are several reasons to do that: > > 1. doing otherwise is counter-intuitive and error-prone (I cannot think > of a single example of a syscall failing and still require the user to > free the allocated resources). Especially when the documention > explicitly mention that this is the behavior to expect. > > 2. it is inefficient because the buffer is unneeded since there is no > data to transfer back to the user and the buffer will need to be > returned back to io_uring to avoid a leak. > > Signed-off-by: Olivier Langlois <olivier@xxxxxxxxxxxxxx> > --- > fs/io_uring.c | 97 +++++++++++++++++++++++++++++++++------------------ > 1 file changed, 64 insertions(+), 33 deletions(-) > > diff --git a/fs/io_uring.c b/fs/io_uring.c > index 42380ed563c4..502d7cd81a8c 100644 > --- a/fs/io_uring.c > +++ b/fs/io_uring.c [...] > +static unsigned int io_put_kbuf(struct io_kiocb *req, struct io_buffer *kbuf, > + u16 bgid, long res, unsigned int issue_flags) > { > unsigned int cflags; > + struct io_ring_ctx *ctx = req->ctx; > + struct io_buffer *head; > + bool force_nonblock = issue_flags & IO_URING_F_NONBLOCK; > > cflags = kbuf->bid << IORING_CQE_BUFFER_SHIFT; > cflags |= IORING_CQE_F_BUFFER; > req->flags &= ~REQ_F_BUFFER_SELECTED; > - kfree(kbuf); > + > + /* > + * Theoritically, res == 0 could be included as well but that would > + * break the contract established in the man page saying that > + * a buffer is returned if the operation is successful. > + */ > + if (unlikely(res < 0)) { > + io_ring_submit_lock(ctx, !force_nonblock); io_complete_rw() is called from an IRQ context, so it can't sleep/wait. > + > + lockdep_assert_held(&ctx->uring_lock); > + > + head = xa_load(&ctx->io_buffers, bgid); > + if (head) { > + list_add_tail(&kbuf->list, &head->list); > + cflags = 0; > + } else { > + INIT_LIST_HEAD(&kbuf->list); > + if (!xa_insert(&ctx->io_buffers, bgid, kbuf, GFP_KERNEL)) > + cflags = 0; > + } > + io_ring_submit_unlock(ctx, !force_nonblock); > + } > + if (cflags) > + kfree(kbuf); > return cflags; > } > [...] > -static inline void io_rw_done(struct kiocb *kiocb, ssize_t ret) > +static inline void io_rw_done(struct kiocb *kiocb, ssize_t ret, > + unsigned int issue_flags) > { > switch (ret) { > case -EIOCBQUEUED: > @@ -2728,7 +2775,7 @@ static inline void io_rw_done(struct kiocb *kiocb, ssize_t ret) > ret = -EINTR; > fallthrough; > default: > - kiocb->ki_complete(kiocb, ret, 0); > + kiocb->ki_complete(kiocb, ret, issue_flags); Don't remember what the second argument of .ki_complete is for, but it definitely should not be used for issue_flags. E.g. because we get two different meanings for it depending on a context -- either a result from the block layer or issue_flags. -- Pavel Begunkov