Re: [RFC PATCH] io_uring: add a memory barrier before atomic_read

Jackie Liu <liuyun01@xxxxxxxxxx> · Fri, 19 Jul 2019 08:44:27 +0800

> 在 2019年7月19日，00:43，Jens Axboe <axboe@xxxxxxxxx> 写道：
> 
> On 7/18/19 9:41 AM, Jens Axboe wrote:
>> On 7/18/19 6:44 AM, Zhengyuan Liu wrote:
>>> There is a hang issue while using fio to do some basic test. The issue can
>>> been easily reproduced using bellow scripts:
>>> 
>>>          while true
>>>          do
>>>                  fio  --ioengine=io_uring  -rw=write -bs=4k -numjobs=1 \
>>>                       -size=1G -iodepth=64 -name=uring   --filename=/dev/zero
>>>          done
>>> 
>>> After serveral minutes, maybe more, fio would block at
>>> io_uring_enter->io_cqring_wait in order to waiting for previously committed
>>> sqes to be completed and cann't return to user anymore until we send a SIGTERM
>>> to fio. After got SIGTERM, fio turns to hang at io_ring_ctx_wait_and_kill with
>>> a backtrace like this:
>>> 
>>>          [54133.243816] Call Trace:
>>>          [54133.243842]  __schedule+0x3a0/0x790
>>>          [54133.243868]  schedule+0x38/0xa0
>>>          [54133.243880]  schedule_timeout+0x218/0x3b0
>>>          [54133.243891]  ? sched_clock+0x9/0x10
>>>          [54133.243903]  ? wait_for_completion+0xa3/0x130
>>>          [54133.243916]  ? _raw_spin_unlock_irq+0x2c/0x40
>>>          [54133.243930]  ? trace_hardirqs_on+0x3f/0xe0
>>>          [54133.243951]  wait_for_completion+0xab/0x130
>>>          [54133.243962]  ? wake_up_q+0x70/0x70
>>>          [54133.243984]  io_ring_ctx_wait_and_kill+0xa0/0x1d0
>>>          [54133.243998]  io_uring_release+0x20/0x30
>>>          [54133.244008]  __fput+0xcf/0x270
>>>          [54133.244029]  ____fput+0xe/0x10
>>>          [54133.244040]  task_work_run+0x7f/0xa0
>>>          [54133.244056]  do_exit+0x305/0xc40
>>>          [54133.244067]  ? get_signal+0x13b/0xbd0
>>>          [54133.244088]  do_group_exit+0x50/0xd0
>>>          [54133.244103]  get_signal+0x18d/0xbd0
>>>          [54133.244112]  ? _raw_spin_unlock_irqrestore+0x36/0x60
>>>          [54133.244142]  do_signal+0x34/0x720
>>>          [54133.244171]  ? exit_to_usermode_loop+0x7e/0x130
>>>          [54133.244190]  exit_to_usermode_loop+0xc0/0x130
>>>          [54133.244209]  do_syscall_64+0x16b/0x1d0
>>>          [54133.244221]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
>>> 
>>> The reason is that we had added a req to ctx->pending_async at the very end, but
>>> it got no chance to be processed anymore. How could this be happened?
>>> 
>>>          fio#cpu0                                        wq#cpu1
>>> 
>>>          io_add_to_prev_work                    io_sq_wq_submit_work
>>> 
>>>            atomic_read() <<< 1
>>> 
>>>                                                    atomic_dec_return() << 1->0
>>>                                                    list_empty();    <<< true;
>>> 
>>>            list_add_tail()
>>>            atomic_read() << 0 or 1?
>>> 
>>> As was said in atomic_ops.rst, atomic_read does not guarantee that the runtime
>>> initialization by any other thread is visible yet, so we must take care of that
>>> with a proper implicit or explicit memory barrier;
>> 
>> Thanks for looking at this and finding this issue, it does looks like a problem.
>> But I'm not sure about the fix. Shouldn't we just need an smp_mb__after_atomic()
>> on the atomic_dec_return() side of things? Like the below.
>> 
>> 
>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>> index 5ec06e5ba0be..3c2a6f88a6b0 100644
>> --- a/fs/io_uring.c
>> +++ b/fs/io_uring.c
>> @@ -1881,6 +1881,7 @@ static void io_sq_wq_submit_work(struct work_struct *work)
>>  	 */
>>  	if (async_list) {
>>  		ret = atomic_dec_return(&async_list->cnt);
>> +		smp_mb__after_atomic();
>>  		while (!ret && !list_empty(&async_list->list)) {
>>  			spin_lock(&async_list->lock);
>>  			atomic_inc(&async_list->cnt);
>> @@ -1894,6 +1895,7 @@ static void io_sq_wq_submit_work(struct work_struct *work)
>>  				goto restart;
>>  			}
>>  			ret = atomic_dec_return(&async_list->cnt);
>> +			smp_mb__after_atomic();
>>  		}
>>  	}
>> 
>> 
> 
> I don't think this is enough, I actually think your fix is the most
> appropriate. I will apply it, thank you!
> 

Actually, although we can passed test use smp_mb(), but in the end we still do not
understand where the race conditions are, could you explain it. If it is said as 
Zhengyuan, because of atomic_read, I think we should only need smp_rmb. but failed.
smp_rmb can't help us pass the test. At the same time, we have tried smp_wmb, failed too.
it seems that only smp_mb works correctly.

Is it because list_add_tail requires smp_wmb and atomic_read requires smp_rmb? 

--
Jackie Liu