Re: signals not reliably interrupting io_uring_enter anymore

Jens Axboe <axboe@xxxxxxxxx> · Sat, 4 Jul 2020 08:55:39 -0600

On 7/3/20 8:56 PM, Jens Axboe wrote:
> On 7/3/20 8:08 PM, Jens Axboe wrote:
>> On 7/3/20 7:52 PM, Jens Axboe wrote:
>>> On 7/3/20 7:13 PM, Andres Freund wrote:
>>>> Hi, 
>>>>
>>>> On July 3, 2020 5:48:21 PM PDT, Jens Axboe <axboe@xxxxxxxxx> wrote:
>>>>> On 7/3/20 6:15 PM, Andres Freund wrote:
>>>>>> Hi,
>>>>>>
>>>>>> On 2020-07-03 17:00:49 -0700, Andres Freund wrote:
>>>>>>> I haven't yet fully analyzed the problem, but after updating to
>>>>>>> cdd3bb54332f82295ed90cd0c09c78cd0c0ee822 io_uring using postgres
>>>>> does
>>>>>>> not work reliably anymore.
>>>>>>>
>>>>>>> The symptom is that io_uring_enter(IORING_ENTER_GETEVENTS) isn't
>>>>>>> interrupted by signals anymore. The signal handler executes, but
>>>>>>> afterwards the syscall is restarted. Previously io_uring_enter
>>>>> reliably
>>>>>>> returned EINTR in that case.
>>>>>>>
>>>>>>> Currently postgres relies on signals interrupting io_uring_enter().
>>>>> We
>>>>>>> probably can find a way to not do so, but it'd not be entirely
>>>>> trivial.
>>>>>>>
>>>>>>> I suspect the issue is
>>>>>>>
>>>>>>> commit ce593a6c480a22acba08795be313c0c6d49dd35d (tag:
>>>>> io_uring-5.8-2020-07-01, linux-block/io_uring-5.8)
>>>>>>> Author: Jens Axboe <axboe@xxxxxxxxx>
>>>>>>> Date:   2020-06-30 12:39:05 -0600
>>>>>>>
>>>>>>>     io_uring: use signal based task_work running
>>>>>>>
>>>>>>> as that appears to have changed the error returned by
>>>>>>> io_uring_enter(GETEVENTS) after having been interrupted by a signal
>>>>> from
>>>>>>> EINTR to ERESTARTSYS.
>>>>>>>
>>>>>>>
>>>>>>> I'll check to make sure that the issue doesn't exist before the
>>>>> above
>>>>>>> commit.
>>>>>>
>>>>>> Indeed, on cd77006e01b3198c75fb7819b3d0ff89709539bb the PG issue
>>>>> doesn't
>>>>>> exist, which pretty much confirms that the above commit is the issue.
>>>>>>
>>>>>> What was the reason for changing EINTR to ERESTARTSYS in the above
>>>>>> commit? I assume trying to avoid returning spurious EINTRs to
>>>>> userland?
>>>>>
>>>>> Yeah, for when it's running task_work. I wonder if something like the
>>>>> below will do the trick?
>>>>>
>>>>>
>>>>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>>>>> index 700644a016a7..0efa73d78451 100644
>>>>> --- a/fs/io_uring.c
>>>>> +++ b/fs/io_uring.c
>>>>> @@ -6197,11 +6197,11 @@ static int io_cqring_wait(struct io_ring_ctx
>>>>> *ctx, int min_events,
>>>>> 	do {
>>>>> 		prepare_to_wait_exclusive(&ctx->wait, &iowq.wq,
>>>>> 						TASK_INTERRUPTIBLE);
>>>>> -		/* make sure we run task_work before checking for signals */
>>>>> -		if (current->task_works)
>>>>> -			task_work_run();
>>>>> 		if (signal_pending(current)) {
>>>>> -			ret = -ERESTARTSYS;
>>>>> +			if (current->task_works)
>>>>> +				ret = -ERESTARTSYS;
>>>>> +			else
>>>>> +				ret = -EINTR;
>>>>> 			break;
>>>>> 		}
>>>>> 		if (io_should_wake(&iowq, false))
>>>>> @@ -6210,7 +6210,7 @@ static int io_cqring_wait(struct io_ring_ctx
>>>>> *ctx, int min_events,
>>>>> 	} while (1);
>>>>> 	finish_wait(&ctx->wait, &iowq.wq);
>>>>>
>>>>> -	restore_saved_sigmask_unless(ret == -ERESTARTSYS);
>>>>> +	restore_saved_sigmask_unless(ret == -EINTR);
>>>>>
>>>>> 	return READ_ONCE(rings->cq.head) == READ_ONCE(rings->cq.tail) ? ret :
>>>>> 0;
>>>>> }
>>>>
>>>> I'll try in a bit. Suspect however that there'd be trouble if there
>>>> were both an actual signal and task work pending?
>>>
>>> Yes, I have that worry too. We'd really need to check if we have an
>>> actual signal pending - if we do, we still do -EINTR. If not, then we
>>> just do -ERESTARTSYS and restart the system call after task_work has
>>> been completed. Half-assed approach below, I suspect this won't _really_
>>> work without appropriate locking. Which would be unfortunate.
>>>
>>> Either that, or we'd need to know if an actual signal was delivered when
>>> we get re-entered due to returning -ERESTARTSYS. If it was just
>>> task_work being run, then we're fine. But if an actual signal was
>>> pending, then we'd need to return -EINTR.
>>>
>>> CC'ing Oleg to see if he has any good ideas here.
>>
>> This might be simpler:
> 
> Or... That's it for today, I'll check in after the weekend.

This tests out fine for me, and it avoids TWA_SIGNAL if we're not using
an eventfd.

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 700644a016a7..d37d7ea5ebe5 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -4072,14 +4072,22 @@ struct io_poll_table {
 	int error;
 };
 
-static int io_req_task_work_add(struct io_kiocb *req, struct callback_head *cb,
-				int notify)
+static int io_req_task_work_add(struct io_kiocb *req, struct callback_head *cb)
 {
 	struct task_struct *tsk = req->task;
-	int ret;
+	struct io_ring_ctx *ctx = req->ctx;
+	int ret, notify = TWA_RESUME;
 
-	if (req->ctx->flags & IORING_SETUP_SQPOLL)
+	/*
+	 * SQPOLL kernel thread doesn't need notification, just a wakeup.
+	 * If we're not using an eventfd, then TWA_RESUME is always fine,
+	 * as we won't have dependencies between request completions for
+	 * other kernel wait conditions.
+	 */
+	if (ctx->flags & IORING_SETUP_SQPOLL)
 		notify = 0;
+	else if (ctx->cq_ev_fd)
+		notify = TWA_SIGNAL;
 
 	ret = task_work_add(tsk, cb, notify);
 	if (!ret)
@@ -4110,7 +4118,7 @@ static int __io_async_wake(struct io_kiocb *req, struct io_poll_iocb *poll,
 	 * of executing it. We can't safely execute it anyway, as we may not
 	 * have the needed state needed for it anyway.
 	 */
-	ret = io_req_task_work_add(req, &req->task_work, TWA_SIGNAL);
+	ret = io_req_task_work_add(req, &req->task_work);
 	if (unlikely(ret)) {
 		WRITE_ONCE(poll->canceled, true);
 		tsk = io_wq_get_task(req->ctx->io_wq);
@@ -6201,7 +6209,14 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
 		if (current->task_works)
 			task_work_run();
 		if (signal_pending(current)) {
-			ret = -ERESTARTSYS;
+			if (current->jobctl & JOBCTL_TASK_WORK) {
+				spin_lock_irq(&current->sighand->siglock);
+				current->jobctl &= ~JOBCTL_TASK_WORK;
+				recalc_sigpending();
+				spin_unlock_irq(&current->sighand->siglock);
+				continue;
+			}
+			ret = -EINTR;
 			break;
 		}
 		if (io_should_wake(&iowq, false))
@@ -6210,7 +6225,7 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
 	} while (1);
 	finish_wait(&ctx->wait, &iowq.wq);
 
-	restore_saved_sigmask_unless(ret == -ERESTARTSYS);
+	restore_saved_sigmask_unless(ret == -EINTR);
 
 	return READ_ONCE(rings->cq.head) == READ_ONCE(rings->cq.tail) ? ret : 0;
 }

-- 
Jens Axboe