On 2/12/25 2:54 PM, Caleb Sander wrote: > On Wed, Feb 12, 2025 at 12:55?PM Jens Axboe <axboe@xxxxxxxxx> wrote: >> >> On 2/12/25 1:45 PM, Caleb Sander Mateos wrote: >>> In our application issuing NVMe passthru commands, we have observed >>> nvme_uring_cmd fields being corrupted between when userspace initializes >>> the io_uring SQE and when nvme_uring_cmd_io() processes it. >>> >>> We hypothesized that the uring_cmd's were executing asynchronously after >>> the io_uring_enter() syscall returned, yet were still reading the SQE in >>> the userspace-mapped SQ. Since io_uring_enter() had already incremented >>> the SQ head index, userspace reused the SQ slot for a new SQE once the >>> SQ wrapped around to it. >>> >>> We confirmed this hypothesis by "poisoning" all SQEs up to the SQ head >>> index in userspace upon return from io_uring_enter(). By overwriting the >>> nvme_uring_cmd nsid field with a known garbage value, we were able to >>> trigger the err message in nvme_validate_passthru_nsid(), which logged >>> the garbage nsid value. >>> >>> The issue is caused by commit 5eff57fa9f3a ("io_uring/uring_cmd: defer >>> SQE copying until it's needed"). With this commit reverted, the poisoned >>> values in the SQEs are no longer seen by nvme_uring_cmd_io(). >>> >>> Prior to the commit, each uring_cmd SQE was unconditionally memcpy()ed >>> to async_data at prep time. The commit moved this memcpy() to 2 cases >>> when the request goes async: >>> - If REQ_F_FORCE_ASYNC is set to force the initial issue to go async >>> - If ->uring_cmd() returns -EAGAIN in the initial non-blocking issue >>> >>> This patch set fixes a bug in the EAGAIN case where the uring_cmd's sqe >>> pointer is not updated to point to async_data after the memcpy(), >>> as it correctly is in the REQ_F_FORCE_ASYNC case. >>> >>> However, uring_cmd's can be issued async in other cases not enumerated >>> by 5eff57fa9f3a, also leading to SQE corruption. These include requests >>> besides the first in a linked chain, which are only issued once prior >>> requests complete. Requests waiting for a drain to complete would also >>> be initially issued async. >>> >>> While it's probably possible for io_uring_cmd_prep_setup() to check for >>> each of these cases and avoid deferring the SQE memcpy(), we feel it >>> might be safer to revert 5eff57fa9f3a to avoid the corruption risk. >>> As discussed recently in regard to the ublk zero-copy patches[1], new >>> async paths added in the future could break these delicate assumptions. >> >> I don't think it's particularly delicate - did you manage to catch the >> case queueing a request for async execution where the sqe wasn't already >> copied? I did take a quick look after our out-of-band conversation, and >> the only missing bit I immediately spotted is using SQPOLL. But I don't >> think you're using that, right? And in any case, lifetime of SQEs with >> SQPOLL is the duration of the request anyway, so should not pose any >> risk of overwriting SQEs. But I do think the code should copy for that >> case too, just to avoid it being a harder-to-use thing than it should >> be. > > Yes, even with the EAGAIN case fixed, nvme_validate_passthru_nsid() is > still catching the poisoned nsids. However, the log lines now come > from the application thread rather than the iou-wrk thread. > Indeed, we are not using SQPOLL. But we are using IOSQE_SQE_GROUP from > Ming's SQE group patch set[1]. Like IOSQE_IO_LINK, subsequent > operations in a group are issued only once the first completes. The > first operation in the group is a UBLK_IO_PROVIDE_IO_BUF from Ming's > other patch set[2], which should complete synchronously. The > completion should be processed in __io_submit_flush_completions() -> > io_free_batch_list() and queue the remaining grouped operations to be > issued in task work. And all the pending task work should be processed > during io_uring_enter()'s return to userspace. > But some NVMe passthru operations must be going async because they are > observing the poisoned values the application thread writes into the > io_uring SQEs after io_uring_enter() returns. We don't yet have an > explanation for how they end up being issued async; ftrace shows that > in the typical case, all the NVMe passthru operations in the group are > issued during task work before io_uring_enter() returns to userspace. > Perhaps a pending signal could short-circuit the task work processing? I think it'd be a good idea to move testing to the newer patchset, as the sqe grouping was never really fully vetted or included. It's not impossible that it's a side effect of that, even though it very well cold be a generic issue that exists already without those patches. In any case, that'd move us all to a closer-to-upstream test base, without having older patches that aren't upstream in the mix. How are the rings setup for what you're testing? I'm specifically thinking of the task_work related flags. Is it using DEFER_TASKRUN, or is it using the older signal style task_work? -- Jens Axboe