Re: [RFC PATCH] io_uring: add a memory barrier before atomic_read

Zhengyuan Liu <liuzhengyuang521@xxxxxxxxx> · Sat, 20 Jul 2019 23:14:03 +0800

On Sat, Jul 20, 2019 at 2:27 AM Jens Axboe <axboe@xxxxxxxxx> wrote:
>
> On 7/18/19 6:54 PM, Zhengyuan Liu wrote:
> >
> > On 7/19/19 12:43 AM, Jens Axboe wrote:
> >> On 7/18/19 9:41 AM, Jens Axboe wrote:
> >>> On 7/18/19 6:44 AM, Zhengyuan Liu wrote:
> >>>> There is a hang issue while using fio to do some basic test. The issue can
> >>>> been easily reproduced using bellow scripts:
> >>>>
> >>>>             while true
> >>>>             do
> >>>>                     fio  --ioengine=io_uring  -rw=write -bs=4k -numjobs=1 \
> >>>>                          -size=1G -iodepth=64 -name=uring   --filename=/dev/zero
> >>>>             done
> >>>>
> >>>> After serveral minutes, maybe more, fio would block at
> >>>> io_uring_enter->io_cqring_wait in order to waiting for previously committed
> >>>> sqes to be completed and cann't return to user anymore until we send a SIGTERM
> >>>> to fio. After got SIGTERM, fio turns to hang at io_ring_ctx_wait_and_kill with
> >>>> a backtrace like this:
> >>>>
> >>>>             [54133.243816] Call Trace:
> >>>>             [54133.243842]  __schedule+0x3a0/0x790
> >>>>             [54133.243868]  schedule+0x38/0xa0
> >>>>             [54133.243880]  schedule_timeout+0x218/0x3b0
> >>>>             [54133.243891]  ? sched_clock+0x9/0x10
> >>>>             [54133.243903]  ? wait_for_completion+0xa3/0x130
> >>>>             [54133.243916]  ? _raw_spin_unlock_irq+0x2c/0x40
> >>>>             [54133.243930]  ? trace_hardirqs_on+0x3f/0xe0
> >>>>             [54133.243951]  wait_for_completion+0xab/0x130
> >>>>             [54133.243962]  ? wake_up_q+0x70/0x70
> >>>>             [54133.243984]  io_ring_ctx_wait_and_kill+0xa0/0x1d0
> >>>>             [54133.243998]  io_uring_release+0x20/0x30
> >>>>             [54133.244008]  __fput+0xcf/0x270
> >>>>             [54133.244029]  ____fput+0xe/0x10
> >>>>             [54133.244040]  task_work_run+0x7f/0xa0
> >>>>             [54133.244056]  do_exit+0x305/0xc40
> >>>>             [54133.244067]  ? get_signal+0x13b/0xbd0
> >>>>             [54133.244088]  do_group_exit+0x50/0xd0
> >>>>             [54133.244103]  get_signal+0x18d/0xbd0
> >>>>             [54133.244112]  ? _raw_spin_unlock_irqrestore+0x36/0x60
> >>>>             [54133.244142]  do_signal+0x34/0x720
> >>>>             [54133.244171]  ? exit_to_usermode_loop+0x7e/0x130
> >>>>             [54133.244190]  exit_to_usermode_loop+0xc0/0x130
> >>>>             [54133.244209]  do_syscall_64+0x16b/0x1d0
> >>>>             [54133.244221]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> >>>>
> >>>> The reason is that we had added a req to ctx->pending_async at the very end, but
> >>>> it got no chance to be processed anymore. How could this be happened?
> >>>>
> >>>>             fio#cpu0                                        wq#cpu1
> >>>>
> >>>>             io_add_to_prev_work                    io_sq_wq_submit_work
> >>>>
> >>>>               atomic_read() <<< 1
> >>>>
> >>>>                                                       atomic_dec_return() << 1->0
> >>>>                                                       list_empty();    <<< true;
> >>>>
> >>>>               list_add_tail()
> >>>>               atomic_read() << 0 or 1?
> >>>>
> >>>> As was said in atomic_ops.rst, atomic_read does not guarantee that the runtime
> >>>> initialization by any other thread is visible yet, so we must take care of that
> >>>> with a proper implicit or explicit memory barrier;
> >>> Thanks for looking at this and finding this issue, it does looks like a problem.
> >>> But I'm not sure about the fix. Shouldn't we just need an smp_mb__after_atomic()
> >>> on the atomic_dec_return() side of things? Like the below.
> >>>
> >>>
> >>> diff --git a/fs/io_uring.c b/fs/io_uring.c
> >>> index 5ec06e5ba0be..3c2a6f88a6b0 100644
> >>> --- a/fs/io_uring.c
> >>> +++ b/fs/io_uring.c
> >>> @@ -1881,6 +1881,7 @@ static void io_sq_wq_submit_work(struct work_struct *work)
> >>>              */
> >>>             if (async_list) {
> >>>                     ret = atomic_dec_return(&async_list->cnt);
> >>> +           smp_mb__after_atomic();
> >>>                     while (!ret && !list_empty(&async_list->list)) {
> >>>                             spin_lock(&async_list->lock);
> >>>                             atomic_inc(&async_list->cnt);
> >>> @@ -1894,6 +1895,7 @@ static void io_sq_wq_submit_work(struct work_struct *work)
> >>>                                     goto restart;
> >>>                             }
> >>>                             ret = atomic_dec_return(&async_list->cnt);
> >>> +                   smp_mb__after_atomic();
> >>>                     }
> >>>             }
> >>>
> >>>
> >> I don't think this is enough, I actually think your fix is the most
> >> appropriate. I will apply it, thank you!
> >>
> >
> > Hi, Jens.
> > I have tested you fix and the issue still existed. Actually the
> > implementation of atomic_dec_return has been implicitly surrounded
> > already by mb()  and as I know, smp_mb__after/before_atomic are not
> > suitable for atomic_t operation which does not return a value.
>
> We aren't guaranteed to see the atomic_dec_return() update if it happens
> at the same time. So we can either force ordering with the smp_mb(), or
> we can do something ala:
>
>         if (!atomic_sub_return(0, &list->cnt)) {
>                 ...
>
> io_add_to_prev_work() to achieve the same sort of effect. That should
> work as well.

Yeah,  but I'd prefer smp_mb(), since atomic_sub_return(0, &list->cnt) isn't
such clear.

Thanks.
>
> --
> Jens Axboe
>