Re: [PATCH for-5.15] io_uring: fix lacking of protection for compl_nr

Pavel Begunkov <asml.silence@xxxxxxxxx> · Fri, 20 Aug 2021 23:59:41 +0100

On 8/20/21 11:46 PM, Jens Axboe wrote:
> On 8/20/21 4:41 PM, Pavel Begunkov wrote:
>> On 8/20/21 11:30 PM, Jens Axboe wrote:
>>> On 8/20/21 4:28 PM, Pavel Begunkov wrote:
>>>> On 8/20/21 11:09 PM, Jens Axboe wrote:
>>>>> On 8/20/21 3:32 PM, Pavel Begunkov wrote:
>>>>>> On 8/20/21 9:39 PM, Hao Xu wrote:
>>>>>>> 在 2021/8/21 上午2:59, Pavel Begunkov 写道:
>>>>>>>> On 8/20/21 7:40 PM, Hao Xu wrote:
>>>>>>>>> coml_nr in ctx_flush_and_put() is not protected by uring_lock, this
>>>>>>>>> may cause problems when accessing it parallelly.
>>>>>>>>
>>>>>>>> Did you hit any problem? It sounds like it should be fine as is:
>>>>>>>>
>>>>>>>> The trick is that it's only responsible to flush requests added
>>>>>>>> during execution of current call to tctx_task_work(), and those
>>>>>>>> naturally synchronised with the current task. All other potentially
>>>>>>>> enqueued requests will be of someone else's responsibility.
>>>>>>>>
>>>>>>>> So, if nobody flushed requests, we're finely in-sync. If we see
>>>>>>>> 0 there, but actually enqueued a request, it means someone
>>>>>>>> actually flushed it after the request had been added.
>>>>>>>>
>>>>>>>> Probably, needs a more formal explanation with happens-before
>>>>>>>> and so.
>>>>>>> I should put more detail in the commit message, the thing is:
>>>>>>> say coml_nr > 0
>>>>>>>
>>>>>>>   ctx_flush_and put                  other context
>>>>>>>    if (compl_nr)                      get mutex
>>>>>>>                                       coml_nr > 0
>>>>>>>                                       do flush
>>>>>>>                                           coml_nr = 0
>>>>>>>                                       release mutex
>>>>>>>         get mutex
>>>>>>>            do flush (*)
>>>>>>>         release mutex
>>>>>>>
>>>>>>> in (*) place, we do a bunch of unnecessary works, moreover, we
>>>>>>
>>>>>> I wouldn't care about overhead, that shouldn't be much
>>>>>>
>>>>>>> call io_cqring_ev_posted() which I think we shouldn't.
>>>>>>
>>>>>> IMHO, users should expect spurious io_cqring_ev_posted(),
>>>>>> though there were some eventfd users complaining before, so
>>>>>> for them we can do
>>>>>
>>>>> It does sometimes cause issues, see:
>>>>
>>>> I'm used that locking may end up in spurious wakeups. May be
>>>> different for eventfd, but considering that we do batch
>>>> completions and so might be calling it only once per multiple
>>>> CQEs, it shouldn't be.
>>>
>>> The wakeups are fine, it's the ev increment that's causing some issues.
>>
>> If userspace doesn't expect that eventfd may get diverged from the
>> number of posted CQEs, we need something like below. The weird part
>> is that it looks nobody complained about this one, even though it
>> should be happening pretty often. 
> 
> That wasn't the issue we ran into, it was more the fact that eventfd
> would indicate that something had been posted, when nothing had.
> We don't need eventfd notifications to be == number of posted events,
> just if the eventfd notification is inremented, there should be new
> events there.

It's just so commonly mentioned, that for me expecting spurious
events/wakeups is a default. Do we have it documented anywhere?

-- 
Pavel Begunkov