On 4/9/20 11:50 AM, Oleg Nesterov wrote: > On 04/08, Jens Axboe wrote: >> >> So the question remains, we basically have this: >> >> A B >> task_work_run(tsk) >> task_work_add(tsk, io_poll_task_func()) >> process cbs >> wait_for_completion() >> >> with the last wait needing to flush the work added on the B side, since >> that isn't part of the initial list. > > I don't understand you, even remotely :/ > > maybe you can write some pseudo-code ? Sorry, I guess my explanation skills aren't stellar, or perhaps they assume too much prior knowledge about this. I just wasted about a day on debugging this because I had Peter's patch applied, and I should have reviewed that more carefully. So a lot of the hangs were due to just missing the running of some of the work. > who does wait_for_completion(), a callback? or this "tsk" after it does > task_work_run() ? Who does complete() ? How can this wait_for_completion() > help to flush the work added on the B side? And why do you need to do > something special to flush that work? I'll try to explain this. Each io_uring operation has a request associated with it. Each request grabs a reference to the ctx, and the ctx can't exit before we reach zero references (obviously). When we enter ctx teardown, we wait for us to hit zero references. Each ctx is basically just a file descriptor, which means that we most often end up waiting for zero references off the fput() path. Some requests can spawn task_works work items. If we have ordering issues on the task_works list, we can have the fput() ordered before items that need to complete. These items are what ultimately put the request, so you end up in a situation where you block waiting for requests to complete, but these requests can't complete until the the current work has completed. This is the deadlock. Once I got rid of Peter's patch, I solved this by just making the wait-and-free operation go async. This allows any dependent work to run and complete just fine. Here's the patch: https://git.kernel.dk/cgit/linux-block/commit/?h=io_uring-5.7&id=2baf397719af3a8121fcaba964470950356a4a7a and I could perhaps make this smarter by checking if current->task_works != NULL, but I don't think there's much point to that. The important part is that we complete the fput() task_work without blocking, so the remaining work gets a chance to run. Hope this helps! As mentioned in the commit message, we could have a separate list of task_works for io_uring, which is what I originally proposed. But I can also work around it like in the patch referenced above. -- Jens Axboe