On 7/3/20 7:52 PM, Jens Axboe wrote: > On 7/3/20 7:13 PM, Andres Freund wrote: >> Hi, >> >> On July 3, 2020 5:48:21 PM PDT, Jens Axboe <axboe@xxxxxxxxx> wrote: >>> On 7/3/20 6:15 PM, Andres Freund wrote: >>>> Hi, >>>> >>>> On 2020-07-03 17:00:49 -0700, Andres Freund wrote: >>>>> I haven't yet fully analyzed the problem, but after updating to >>>>> cdd3bb54332f82295ed90cd0c09c78cd0c0ee822 io_uring using postgres >>> does >>>>> not work reliably anymore. >>>>> >>>>> The symptom is that io_uring_enter(IORING_ENTER_GETEVENTS) isn't >>>>> interrupted by signals anymore. The signal handler executes, but >>>>> afterwards the syscall is restarted. Previously io_uring_enter >>> reliably >>>>> returned EINTR in that case. >>>>> >>>>> Currently postgres relies on signals interrupting io_uring_enter(). >>> We >>>>> probably can find a way to not do so, but it'd not be entirely >>> trivial. >>>>> >>>>> I suspect the issue is >>>>> >>>>> commit ce593a6c480a22acba08795be313c0c6d49dd35d (tag: >>> io_uring-5.8-2020-07-01, linux-block/io_uring-5.8) >>>>> Author: Jens Axboe <axboe@xxxxxxxxx> >>>>> Date: 2020-06-30 12:39:05 -0600 >>>>> >>>>> io_uring: use signal based task_work running >>>>> >>>>> as that appears to have changed the error returned by >>>>> io_uring_enter(GETEVENTS) after having been interrupted by a signal >>> from >>>>> EINTR to ERESTARTSYS. >>>>> >>>>> >>>>> I'll check to make sure that the issue doesn't exist before the >>> above >>>>> commit. >>>> >>>> Indeed, on cd77006e01b3198c75fb7819b3d0ff89709539bb the PG issue >>> doesn't >>>> exist, which pretty much confirms that the above commit is the issue. >>>> >>>> What was the reason for changing EINTR to ERESTARTSYS in the above >>>> commit? I assume trying to avoid returning spurious EINTRs to >>> userland? >>> >>> Yeah, for when it's running task_work. I wonder if something like the >>> below will do the trick? >>> >>> >>> diff --git a/fs/io_uring.c b/fs/io_uring.c >>> index 700644a016a7..0efa73d78451 100644 >>> --- a/fs/io_uring.c >>> +++ b/fs/io_uring.c >>> @@ -6197,11 +6197,11 @@ static int io_cqring_wait(struct io_ring_ctx >>> *ctx, int min_events, >>> do { >>> prepare_to_wait_exclusive(&ctx->wait, &iowq.wq, >>> TASK_INTERRUPTIBLE); >>> - /* make sure we run task_work before checking for signals */ >>> - if (current->task_works) >>> - task_work_run(); >>> if (signal_pending(current)) { >>> - ret = -ERESTARTSYS; >>> + if (current->task_works) >>> + ret = -ERESTARTSYS; >>> + else >>> + ret = -EINTR; >>> break; >>> } >>> if (io_should_wake(&iowq, false)) >>> @@ -6210,7 +6210,7 @@ static int io_cqring_wait(struct io_ring_ctx >>> *ctx, int min_events, >>> } while (1); >>> finish_wait(&ctx->wait, &iowq.wq); >>> >>> - restore_saved_sigmask_unless(ret == -ERESTARTSYS); >>> + restore_saved_sigmask_unless(ret == -EINTR); >>> >>> return READ_ONCE(rings->cq.head) == READ_ONCE(rings->cq.tail) ? ret : >>> 0; >>> } >> >> I'll try in a bit. Suspect however that there'd be trouble if there >> were both an actual signal and task work pending? > > Yes, I have that worry too. We'd really need to check if we have an > actual signal pending - if we do, we still do -EINTR. If not, then we > just do -ERESTARTSYS and restart the system call after task_work has > been completed. Half-assed approach below, I suspect this won't _really_ > work without appropriate locking. Which would be unfortunate. > > Either that, or we'd need to know if an actual signal was delivered when > we get re-entered due to returning -ERESTARTSYS. If it was just > task_work being run, then we're fine. But if an actual signal was > pending, then we'd need to return -EINTR. > > CC'ing Oleg to see if he has any good ideas here. This might be simpler: diff --git a/fs/io_uring.c b/fs/io_uring.c index 700644a016a7..8f54fd5085bf 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -6197,11 +6197,11 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events, do { prepare_to_wait_exclusive(&ctx->wait, &iowq.wq, TASK_INTERRUPTIBLE); - /* make sure we run task_work before checking for signals */ - if (current->task_works) - task_work_run(); if (signal_pending(current)) { - ret = -ERESTARTSYS; + if (!sigisemptyset(¤t->pending.signal)) + ret = -EINTR; + else + ret = -ERESTARTSYS; break; } if (io_should_wake(&iowq, false)) @@ -6210,7 +6210,7 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events, } while (1); finish_wait(&ctx->wait, &iowq.wq); - restore_saved_sigmask_unless(ret == -ERESTARTSYS); + restore_saved_sigmask_unless(ret == -EINTR); return READ_ONCE(rings->cq.head) == READ_ONCE(rings->cq.tail) ? ret : 0; } -- Jens Axboe