On 5/27/21 10:32 AM, Marco Elver wrote: > On Wed, May 26, 2021 at 09:31PM +0100, Pavel Begunkov wrote: >> On 5/26/21 5:36 PM, Marco Elver wrote: >>> On Wed, 26 May 2021 at 18:29, Pavel Begunkov <asml.silence@xxxxxxxxx> wrote: >>>> On 5/26/21 4:52 PM, Marco Elver wrote: >>>>> Due to some moving around of code, the patch lost the actual fix (using >>>>> atomically read io_wq) -- so here it is again ... hopefully as intended. >>>>> :-) >>>> >>>> "fortify" damn it... It was synchronised with &ctx->uring_lock >>>> before, see io_uring_try_cancel_iowq() and io_uring_del_tctx_node(), >>>> so should not clear before *del_tctx_node() >>> >>> Ah, so if I understand right, the property stated by the comment in >>> io_uring_try_cancel_iowq() was broken, and your patch below would fix >>> that, right? >> >> "io_uring: fortify tctx/io_wq cleanup" broke it and the diff >> should fix it. >> >>>> The fix should just move it after this sync point. Will you send >>>> it out as a patch? >>> >>> Do you mean your move of write to io_wq goes on top of the patch I >>> proposed? (If so, please also leave your Signed-of-by so I can squash >>> it.) >> >> No, only my diff, but you hinted on what has happened, so I would >> prefer you to take care of patching. If you want of course. >> >> To be entirely fair, assuming that aligned ptr >> reads can't be torn, I don't see any _real_ problem. But surely >> the report is very helpful and the current state is too wonky, so >> should be patched. > > In the current version, it is a problem if we end up with a double-read, > as it is in the current C code. The compiler might of course optimize > it into 1 read into a register. Absolutely agree on that > Tangent: I avoid reasoning in terms of compiler optimizations where > I can. :-) It's is a slippery slope if the code in question isn't > tolerant to data races by design (examples are stats counting, or other > heuristics -- in the case here that's certainly not the case). > Therefore, my wish is that we really ought to resolve as many data races > as we can (+ mark intentional ones appropriately). Also, so that we're > left with only the interesting cases like in the case here. (More > background if you're interested: https://lwn.net/Articles/816850/) > > The problem here, however, has a nicer resolution as you suggested. > >> TL;DR; >> The synchronisation goes as this: it's usually used by the owner >> task, and the owner task deletes it, so is mostly naturally >> synchronised. An exception is a worker (not only) that accesses >> it for cancellation purpose, but it uses it only under ->uring_lock, >> so if removal is also taking the lock it should be fine. see >> io_uring_del_tctx_node() locking. > > Did you mean io_uring_del_task_file()? There is no > io_uring_del_tctx_node(). Ah, yes, that's from patches I sent for next. -- Pavel Begunkov