Re: [PATCH V2] io_uring: fix IO hang in io_wq_put_and_exit from do_exit()

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, Sep 08, 2023 at 02:03:11AM +0100, Pavel Begunkov wrote:
> On 9/7/23 16:36, Jens Axboe wrote:
> > On 9/1/23 9:13 AM, Jens Axboe wrote:
> > > 
> > > On Fri, 01 Sep 2023 21:49:16 +0800, Ming Lei wrote:
> > > > io_wq_put_and_exit() is called from do_exit(), but all FIXED_FILE requests
> > > > in io_wq aren't canceled in io_uring_cancel_generic() called from do_exit().
> > > > Meantime io_wq IO code path may share resource with normal iopoll code
> > > > path.
> > > > 
> > > > So if any HIPRI request is submittd via io_wq, this request may not get resouce
> > > > for moving on, given iopoll isn't possible in io_wq_put_and_exit().
> > > > > > [...]
> > > 
> > > Applied, thanks!
> > > 
> > > [1/1] io_uring: fix IO hang in io_wq_put_and_exit from do_exit()
> > >        commit: b484a40dc1f16edb58e5430105a021e1916e6f27
> > 
> > This causes a regression with the test/thread-exit.t test case, as it's
> > canceling requests from other tasks as well. I will drop this patch for
> > now.
> 
> And this one has never hit my mailbox... Anyway, I'm confused with
> the issue:
> 
> 1) request tracking is done only for io_uring polling io_uring, which

request tracking isn't done on FIXED_FILE IO, which is used by t/io_uring.

> shouldn't be the case for t/io_uring, so it's probably unrelated?

So io_uring_try_cancel_requests() won't be called because
tctx_inflight() returns zero.

> 
> 2) In case of iopoll, io-wq only submits a request but doesn't wait/poll
> for it. If io_issue_sqe() returned -EAGAIN or an error, the request is
> considered not yet submitted to block and can be just cancelled normally
> without any dancing like io_iopoll_try_reap_events().

io_issue_sqe() may never return since IO_URING_F_NONBLOCK isn't set
for iopoll, and recently polled IO doesn't imply nowait in commit
2bc057692599 ("block: don't make REQ_POLLED imply REQ_NOWAIT"), this
commit is actually fragile, cause it is easy to cause io hang if
iopoll & submission is done in same context.

This one should be easy to address, either the following change
or revert 2bc057692599 ("block: don't make REQ_POLLED imply
REQ_NOWAIT").

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index ad636954abae..d8419689ad3e 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -1930,6 +1930,9 @@ void io_wq_submit_work(struct io_wq_work *work)
                }
        }

+       if ((req->ctx->flags & IORING_SETUP_IOPOLL) && def->iopoll_queue)
+               issue_flags |= IO_URING_F_NONBLOCK;
+
        do {
                ret = io_issue_sqe(req, issue_flags);
                if (ret != -EAGAIN)

> 
> 
> 3) If we condense the code it sounds like it effectively will be
> like this:
> 
> void io_wq_exit_start(struct io_wq *wq)
> {
> 	set_bit(IO_WQ_BIT_EXIT, &wq->state);
> }
> 
> io_uring_cancel_generic()
> {
> 	if (tctx->io_wq)
> 		io_wq_exit_start(tctx->io_wq);
> 	io_uring_clean_tctx(tctx);
> 	...
> }
> 
> We set the flag, interrupt workers (TIF_NOTIFY_SIGNAL + wake up), and
> wait for them. Workers are woken up (or just return), see
> the flag and return. At least that's how it was intended to work.
> 
> What's missing? Racing for IO_WQ_BIT_EXIT? Not breaking on IO_WQ_BIT_EXIT
> correctly? Not honouring / clearing TIF_NOTIFY_SIGNAL?
> 
> I'll try to repro later

After applying your patch of 9256b9371204 ("io_uring: break out of iowq
iopoll on teardown") and the above patch, the issue still can be triggered,
and the worker is keeping to call io_issue_sqe() for ever, and
io_wq_worker_stopped() returns false. So do_exit() isn't called on
t/io_uring pthread yet, meantime I guess either iopoll is terminated or not
get chance to run.


Thanks,
Ming




[Index of Archives]     [Linux Samsung SoC]     [Linux Rockchip SoC]     [Linux Actions SoC]     [Linux for Synopsys ARC Processors]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]


  Powered by Linux