On 1/18/22 8:13 AM, Florian Fischer wrote: > Hello, > > during our research on entangling io_uring and parallel runtime systems one of our test > cases results in situations where an `IORING_OP_ASYNC_CANCEL` request can not find (-ENOENT) > or not cancel (EALREADY) a previously submitted read of an event file descriptor. > However, the previously submitted read also never generates a CQE. > We now wonder if this is a bug in the kernel, or, at least in the case of EALRADY, works as intended. > Our current architecture expects that a request eventually creates a CQE when canceled. > > > # Reproducer pseudo-code: > > create N eventfds > create N threads > > thread_function: > create thread-private io_uring queue pair > for (i = 0, i < ITERATIONS, i++) > submit read from eventfd n > submit read from eventfd (n + 1) % N > submit write to eventfd (n + 2) % N > await completions until the write completion was reaped > submit cancel requests for the two read requests > await all outstanding requests (minus a possible already completed read request) > > Note that: > - Each eventfd is read twice but only written once. > - The read requests are canceled independently of their state. > - There are five io_uring requests per loop iteration > > > # Expectation > > Each of the five submitted request should be completed: > * Write is always successful because writing to an eventfd only blocks > if the counter reaches 0xfffffffffffffffe and we add only 1 in each iteration. > Furthermore the read from the file descriptor resets the counter to 0. > * The cancel requests are always completed with different return values > dependent on the state of the read request to cancel. > * The read requests should always be completed either because some data is available > to read or because they are canceled. > > > # Observation: > > Sometimes threads block in io_uring_enter forever because one read request > is never completed and the cancel of such read returned with -ENOENT or -EALREADY. > > A C program to reproduce this situation is attached. > It contains the essence of the previously mentioned test case with instructions > how to compile and execute it. > > The following log excerpt was generated using a version of the reproducer > where each write adds 0 to the eventfd count and thus not completing read requests. > This means all read request should be canceled and all cancel requests should either > return with 0 (the request was found and canceled) or -EALREADY the read is already > in execution and should be interrupted. > > 0 Prepared read request (evfd: 0, tag: 1) > 0 Submitted 1 requests -> 1 inflight > 0 Prepared read request (evfd: 1, tag: 2) > 0 Submitted 1 requests -> 2 inflight > 0 Prepared write request (evfd: 2) > 0 Submitted 1 requests -> 3 inflight > 0 Collect write completion: 8 > 0 Prepared cancel request for 1 > 0 Prepared cancel request for 2 > 0 Submitted 2 requests -> 4 inflight > 0 Collect read 1 completion: -125 - Operation canceled > 0 Collect cancel read 1 completion: 0 > 0 Collect cancel read 2 completion: -2 - No such file or directory > > Thread 0 blocks forever because the second read could not be > canceled (-ENOENT in the last line) but no completion is ever created for it. > > The far more common situation with the reproducer and adding 1 to the eventfds in each loop > is that a request is not canceled and the cancel attempt returned with -EALREADY. > There is no progress because the writer has already finished its loop and the cancel > apparently does not really cancel the request. > > 1 Starting iteration 996 > 1 Prepared read request (evfd: 1, tag: 1) > 1 Submitted 1 requests -> 1 inflight > 1 Prepared read request (evfd: 2, tag: 2) > 1 Submitted 1 requests -> 2 inflight > 1 Prepared write request (evfd: 0) > 1 Submitted 1 requests -> 3 inflight > 1 Collect write completion: 8 > 1 Prepared cancel request for read 1 > 1 Prepared cancel request for read 2 > 1 Submitted 2 requests -> 4 inflight > 1 Collect read 1 completion: -125 - Operation canceled > 1 Collect cancel read 1 completion: 0 > 1 Collect cancel read 2 completion: -114 - Operation already in progress > > After reading the io_uring_enter(2) man page a IORING_OP_ASYNC_CANCEL's return value of -EALREADY apparently > may not cause the request to terminate. At least that is our interpretation of "…res field will contain -EALREADY. > In this case, the request may or may not terminate." > > I could reliably reproduce the behavior on different hardware, linux versions > from 5.9 to 5.16 as well as liburing versions 0.7 and 2.1. > > With linux 5.6 I was not able to reproduce this cancel miss. > > So is the situation we see intended behavior of the API or is it a faulty race in the > io_uring cancel code? > If it is intended then it becomes really hard to build reliable abstractions > using io_uring's cancellation. > We really like to have the invariant that a canceled io_uring operation eventually > generates a cqe, either completed or canceled/interrupted. I took a look at this, and my theory is that the request cancelation ends up happening right in between when the work item is moved between the work list and to the worker itself. The way the async queue works, the work item is sitting in a list until it gets assigned by a worker. When that assignment happens, it's removed from the general work list and then assigned to the worker itself. There's a small gap there where the work cannot be found in the general list, and isn't yet findable in the worker itself either. Do you always see -ENOENT from the cancel when you get the hang condition? I'll play with this a bit and see if we can't close this hole so the work is always reliably discoverable (and hence can get canceled). -- Jens Axboe