Re: Canceled read requests never completed

Florian Fischer <florian.fl.fischer@xxxxxx> · Wed, 19 Jan 2022 14:42:22 +0100

On 18.01.2022 19:32, Jens Axboe wrote:
> On 1/18/22 4:36 PM, Jens Axboe wrote:
> > On 1/18/22 1:05 PM, Florian Fischer wrote:
> >>>> After reading the io_uring_enter(2) man page a IORING_OP_ASYNC_CANCEL's return value of -EALREADY apparently
> >>>> may not cause the request to terminate. At least that is our interpretation of "…res field will contain -EALREADY.
> >>>> In this case, the request may or may not terminate."
> >>>
> >>> I took a look at this, and my theory is that the request cancelation
> >>> ends up happening right in between when the work item is moved between
> >>> the work list and to the worker itself. The way the async queue works,
> >>> the work item is sitting in a list until it gets assigned by a worker.
> >>> When that assignment happens, it's removed from the general work list
> >>> and then assigned to the worker itself. There's a small gap there where
> >>> the work cannot be found in the general list, and isn't yet findable in
> >>> the worker itself either.
> >>>
> >>> Do you always see -ENOENT from the cancel when you get the hang
> >>> condition?
> >>
> >> No we also and actually more commonly observe cancel returning
> >> -EALREADY and the canceled read request never gets completed.
> >>
> >> As shown in the log snippet I included below.
> > 
> > I think there are a couple of different cases here. Can you try the
> > below patch? It's against current -git.
> 
> Cleaned it up and split it into functional bits, end result is here:
> 
> https://git.kernel.dk/cgit/linux-block/log/?h=io_uring-5.17

Thanks. I have build and tested your patches.

The most common error we observed (read -> read -> write -> 2x cancel)
is no longer reproducible and our originally test case works flawless :)

Nor could I reproduce any hangs with cancel returning -ENOENT.

But I still can reliably reproduce stuck threads when not incrementing the evfd
count and thus never completing reads due to available data to read.
(read -> read -> write (do not increment evfd count) -> 2x cancel)

I further reduced the attached C program to reproduce the above problem.
The code is also available now at our gitlab [1].

The following log output was created with a less 'minimal' version still including
log functionality:

  75 Collect write completion: 8
  75 Collect cancel read 1 completion: 0
  75 Collect cancel read 2 completion: -114
  75 Collect read 1 completion: -125
  75 Collect read 2 completion: -4

  75 Collect write completion: 8
  75 Collect cancel read 1 completion: 0
  75 Collect cancel read 2 completion: -114
  75 Collect read 1 completion: -125
  thread 75 stuck here

The scenario seams extremely artificial but non or less I think it should
work regardless of its usefulness.

Flo Fischer

[1]: https://gitlab.cs.fau.de/aj46ezos/io_uring-cancel/-/tree/minimal-write0
/**
 * io_uring cancel miss reproducer
 * Copyright 2022 Florian Fischer <f.fischer@xxxxxxxxx>
 *
 * io_urings's IORING_OP_ASYNC_CANCEL sometimes does not cancel inflight requests.
 *
 * Tested systems:
 * - Xeon E7-4830 Debian 11
 *   - liburing 2.1
 *   - reproducable on: linux-block/io_uring-5.17
 *
 * Building the reproducer:
 *   $ gcc -Werror -Wall -g -O3 io_uring-cancel.c -o io_uring-cancel -pthread -luring
 *
 * Steps to reproduce on the above system:
 * 1. Compile the reproducer
 * 2. Run the reproducer in a loop
 *   $ for i in $(seq 100); do echo $i; ./io_uring-cancel; done
 *
 *   (I need < ~30 runs to reproduce a stuck thread with 96 threads)
 *
 * Explanation:
 * The reproducer opens #CPUs event file descriptors and starts #CPUs threads.
 * Each thread creates an io_uring and enters a loop.
 * In each iteration the thread submits a read request from evfds[thread_id].
 * Submits a second read request from evfds[(thread_id + 1) % #CPUs].
 * And issues a write request for evfds[(thread_id + 2) % #CPUs] but the write
 * request does not increase the eventfds count thus it will not complete any read.
 * The write request is awaited.
 * Then two cancel requests for the reads are submitted and the outstanding
 * requests are awaited.
 *
 * The reproducer gets stuck because a cancel request returning with -EALREADY
 * did not properly cancel the read the read request is never completed.
 *
 * WhenpPassing a non-zero value to the reproducer as write increment
 *   $ ./io_uring-cancel 1
 * I could not longer reproduce stuck threads.
 */
#define _GNU_SOURCE
#include <assert.h>
#include <err.h>
#include <errno.h>
#include <liburing.h>
#include <pthread.h>
#include <stdlib.h>
#include <sys/eventfd.h>
#include <sys/sysinfo.h>
#include <unistd.h>

#define IORING_ENTRIES 2
#define ITERATIONS 1000

_Thread_local size_t thread_id;

static pthread_t* threads;
static int* evfds;
static unsigned nthreads;
static unsigned write_increment = 0;

static struct io_uring_sqe* checked_get_sqe(struct io_uring* ring) {
	struct io_uring_sqe* sqe = io_uring_get_sqe(ring);
	assert(sqe);
	return sqe;
}

static void prep_evfd_read(struct io_uring* ring, int evfd, uint64_t* buf, uintptr_t tag) {
	struct io_uring_sqe* sqe = checked_get_sqe(ring);

	io_uring_prep_read(sqe, evfds[evfd], buf, sizeof(*buf), 0);
	io_uring_sqe_set_data(sqe, (void*)tag);
}

static void prep_evfd_write(struct io_uring* ring, int evfd, uint64_t* buf) {
	struct io_uring_sqe* sqe = checked_get_sqe(ring);

	io_uring_prep_write(sqe, evfds[evfd], buf, sizeof(*buf), 0);
}

static void prep_cancel(struct io_uring* ring, uintptr_t tag) {
	struct io_uring_sqe* sqe = checked_get_sqe(ring);

	io_uring_prep_cancel(sqe, (void*)tag, 0);
}

static void checked_submit_n(struct io_uring* ring, int n) {
	int res = io_uring_submit(ring);
	if (res < 0) err(EXIT_FAILURE, "io_uring_submit failed");
	if (res != n) errx(EXIT_FAILURE, "io_uring_submit submitted less sqes than preprared");
}

static void checked_submit_one(struct io_uring* ring) {
	checked_submit_n(ring, 1);
}

static void collect_n_completions(struct io_uring* ring, int n) {
	struct io_uring_cqe* cqe;
	for (int i = 0; i < n; ++i) {
		int res = io_uring_wait_cqe(ring, &cqe);
		if (res) err(EXIT_FAILURE, "io_uring_wait_cqe failed");
		io_uring_cqe_seen(ring, cqe);
	}
}

void* thread_func(void* arg) {
	thread_id = (uintptr_t)(arg);

	struct io_uring ring;
	int res = io_uring_queue_init(IORING_ENTRIES, &ring, 0);
	if (res) err(EXIT_FAILURE, "io_uring_queue_init failed");

	for (unsigned i = 0; i < ITERATIONS; ++i) {
		uint64_t rbuf = 0;
		uint64_t wbuf = write_increment;

		prep_evfd_read(&ring, thread_id, &rbuf, 1);
		checked_submit_one(&ring);

		prep_evfd_read(&ring, (thread_id + 1) % nthreads, &rbuf, 2);
		checked_submit_one(&ring);

		prep_evfd_write(&ring, (thread_id + 2) % nthreads, &wbuf);
		checked_submit_one(&ring);
		collect_n_completions(&ring, 1);

		prep_cancel(&ring, 1);
		prep_cancel(&ring, 2);
		checked_submit_n(&ring, 2);
		collect_n_completions(&ring, 4);
	}

	return NULL;
}

int main(int argc, char* argv[]) {
	if (argc > 1)
		write_increment = atoi(argv[1]);

	nthreads = get_nprocs();
	// nthreads = 10;
	evfds = malloc(sizeof(int) * nthreads);
	if (!evfds) err(EXIT_FAILURE, "malloc failed");

	for (unsigned i = 0; i < nthreads; ++i) {
		evfds[i] = eventfd(0, 0);
		if(!evfds[i]) err(EXIT_FAILURE, "eventfd failed");
	}

	threads = malloc(sizeof(pthread_t) * nthreads);
	if (!threads) err(EXIT_FAILURE, "malloc failed");

	for (unsigned i = 0; i < nthreads; ++i) {
		errno = pthread_create(&threads[i], NULL, thread_func, (void*)(uintptr_t)i);
		if (errno) err(EXIT_FAILURE, "pthread_create failed");
	}

	for (unsigned i = 0; i < nthreads; ++i) {
		errno = pthread_join(threads[i], NULL);
		if (errno) err(EXIT_FAILURE, "pthread_join failed");
	}

	exit(EXIT_SUCCESS);
}