Hi all, On 6/4/24 3:27 PM, Miklos Szeredi wrote: > On Tue, 4 Jun 2024 at 03:57, Jingbo Xu <jefflexu@xxxxxxxxxxxxxxxxx> wrote: > >> IIUC, there are two sources that may cause deadlock: >> 1) the fuse server needs memory allocation when processing FUSE_WRITE >> requests, which in turn triggers direct memory reclaim, and FUSE >> writeback then - deadlock here > > Yep, see the folio_wait_writeback() call deep in the guts of direct > reclaim, which sleeps until the PG_writeback flag is cleared. If that > happens to be triggered by the writeback in question, then that's a > deadlock. After diving deep into the direct reclaim code, there are some insights may be helpful. Back to the time when the support for fuse writeback is introduced, i.e. commit 3be5a52b30aa ("fuse: support writable mmap") since v2.6.26, the direct reclaim indeed unconditionally waits for PG_writeback flag being cleared. At that time the direct reclaim is implemented in a two-stage style, stage 1) pass over the LRU list to start parallel writeback asynchronously, and stage 2) synchronously wait for completion of the writeback previously started. This two-stage design and the unconditionally waiting for PG_writeback flag being cleared is removed by commit 41ac199 ("mm: vmscan: do not stall on writeback during memory compaction") since v3.5. Though the direct reclaim logic continues to evolve and the waiting is added back, now the stall will happen only when the direct reclaim is triggered from kswapd or memory cgroup. Specifically the stall will only happen in following certain conditions (see shrink_folio_list() for details): 1) kswapd 2) or it's a user process under a non-root memory cgroup (actually cgroup_v1) with GFP_IO permitted Thus the potential deadlock does not exist actually (if I'm not wrong) if: 1) cgroup is not enabled 2) or cgroup_v2 is actually used 3) or (memory cgroup is enabled and is attached upon cgroup_v1) the fuse server actually resides under the root cgroup 4) or (the fuse server resides under a non-root memory cgroup_v1), but the fuse server advertises itself as a PR_IO_FLUSHER[1] Then we could considering adding a new feature bit indicating that any one of the above condition is met and thus the fuse server is safe from the potential deadlock inside direct reclaim. When this feature bit is set, the kernel side could bypass the temp page copying when doing writeback. As for the condition 4 (PR_IO_FLUSHER), there was a concern from Miklos[2]. I think the new feature bit could be disabled by default, and enabled only when the fuse server itself guarantees that it is in a safe distribution condition. Even when it's enabled either by a mistake or a malicious fuse server, and thus causes a deadlock, maybe the sysadmin could still abort the connection through the abort sysctl knob? Just some insights and brainstorm here. [1] https://lore.kernel.org/all/Zl4%2FOAsMiqB4LO0e@xxxxxxxxxxxxxxxxxxx/ [2] https://lore.kernel.org/all/CAJfpegvYpWuTbKOm1hoySHZocY+ki07EzcXBUX8kZx92T8W6uQ@xxxxxxxxxxxxxx/ -- Thanks, Jingbo