On Tue, Feb 11, 2025 at 09:11:10PM -0700, Keith Busch wrote: > On Wed, Feb 12, 2025 at 10:49:15AM +0800, Ming Lei wrote: > > On Mon, Feb 10, 2025 at 04:56:44PM -0800, Keith Busch wrote: > > > From: Keith Busch <kbusch@xxxxxxxxxx> > > > > > > Provide new operations for the user to request mapping an active request > > > to an io uring instance's buf_table. The user has to provide the index > > > it wants to install the buffer. > > > > > > A reference count is taken on the request to ensure it can't be > > > completed while it is active in a ring's buf_table. > > > > > > Signed-off-by: Keith Busch <kbusch@xxxxxxxxxx> > > > --- > > > drivers/block/ublk_drv.c | 145 +++++++++++++++++++++++++--------- > > > include/uapi/linux/ublk_cmd.h | 4 + > > > 2 files changed, 113 insertions(+), 36 deletions(-) > > > > > > diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c > > > index 529085181f355..ccfda7b2c24da 100644 > > > --- a/drivers/block/ublk_drv.c > > > +++ b/drivers/block/ublk_drv.c > > > @@ -51,6 +51,9 @@ > > > /* private ioctl command mirror */ > > > #define UBLK_CMD_DEL_DEV_ASYNC _IOC_NR(UBLK_U_CMD_DEL_DEV_ASYNC) > > > > > > +#define UBLK_IO_REGISTER_IO_BUF _IOC_NR(UBLK_U_IO_REGISTER_IO_BUF) > > > +#define UBLK_IO_UNREGISTER_IO_BUF _IOC_NR(UBLK_U_IO_UNREGISTER_IO_BUF) > > > > UBLK_IO_REGISTER_IO_BUF command may be completed, and buffer isn't used > > by RW_FIXED yet in the following cases: > > > > - application doesn't submit any RW_FIXED consumer OP > > > > - io_uring_enter() only issued UBLK_IO_REGISTER_IO_BUF, and the other > > OPs can't be issued because of out of resource > > > > ... > > > > Then io_uring_enter() returns, and the application is panic or killed, > > how to avoid buffer leak? > > The death of the uring that registered the node tears down the table > that it's registered with, which releases its reference. All good. OK, looks I miss the point. io_sqe_buffers_unregister() is called from io_ring_ctx_free(), when the registered buffer can be released. However, it still may cause use-after-free on this request which has been failed from io_uring_try_cancel_uring_cmd(), and please see the following code path: io_uring_try_cancel_requests io_uring_try_cancel_uring_cmd ublk_uring_cmd_cancel_fn ublk_abort_requests ublk_abort_queue __ublk_fail_req ublk_put_req_ref The above race needs to be covered. > > > It need to deal with in io_uring cancel code for calling ->release() if > > the kbuffer node isn't released. > > There should be no situation here where it isn't released after its use > is completed. Either the resource was gracefully unregistered or the > ring close while it was still active, but either one drops its > reference. > > > UBLK_IO_UNREGISTER_IO_BUF still need to call ->release() if the node > > buffer isn't used. > > Only once the last reference is dropped. Which should happen no matter > which way the node is freed. > > > > +static void ublk_io_release(void *priv) > > > +{ > > > + struct request *rq = priv; > > > + struct ublk_queue *ubq = rq->mq_hctx->driver_data; > > > + > > > + ublk_put_req_ref(ubq, rq); > > > +} > > > > It isn't enough to just get & put request reference here between registering > > buffer and freeing the registered node buf, because the same reference can be > > dropped from ublk_commit_completion() which is from queueing > > UBLK_IO_COMMIT_AND_FETCH_REQ, and buggy app may queue this command multiple > > times for freeing the request. > > > > One solution is to not allow request completion until the ->release() is > > returned. > > Double completions are tricky because the same request id can be reused > pretty quickly and there's no immediate way to tell if the 2nd > completion is a double or a genuine completion of the reused request. > > We have rotating sequence numbers in the nvme driver to try to detect a > similar situation. So far it hasn't revealed any real bugs as far as I > know. This feels like the other side screwed up and that's their fault. Not same with nvme, in which nvme controller won't run DMA on this buffer after the 1st completion. The ublk request buffer has been leased to io_uring for running read_fixed/write_fixed, meantime it is freed and reused by kernel for other purpose. As I mentioned, it can be solved by not allowing to complete the IO command if the buffer is leased to io_uring. Thanks, Ming