On 10/12/24 7:08 AM, Joanne Koong wrote: > On Fri, Sep 13, 2024 at 1:55 PM Joanne Koong <joannelkoong@xxxxxxxxx> wrote: >> >> On Thu, Sep 12, 2024 at 8:35 PM Jingbo Xu <jefflexu@xxxxxxxxxxxxxxxxx> wrote: >>> >>> On 9/13/24 7:18 AM, Joanne Koong wrote: >>>> On Wed, Sep 11, 2024 at 2:32 AM Jingbo Xu <jefflexu@xxxxxxxxxxxxxxxxx> wrote: >>>>> >>>>> Hi all, >>>>> >>>>> On 6/4/24 3:27 PM, Miklos Szeredi wrote: >>>>>> On Tue, 4 Jun 2024 at 03:57, Jingbo Xu <jefflexu@xxxxxxxxxxxxxxxxx> wrote: >>>>>> >>>>>>> IIUC, there are two sources that may cause deadlock: >>>>>>> 1) the fuse server needs memory allocation when processing FUSE_WRITE >>>>>>> requests, which in turn triggers direct memory reclaim, and FUSE >>>>>>> writeback then - deadlock here >>>>>> >>>>>> Yep, see the folio_wait_writeback() call deep in the guts of direct >>>>>> reclaim, which sleeps until the PG_writeback flag is cleared. If that >>>>>> happens to be triggered by the writeback in question, then that's a >>>>>> deadlock. >>>>> >>>>> After diving deep into the direct reclaim code, there are some insights >>>>> may be helpful. >>>>> >>>>> Back to the time when the support for fuse writeback is introduced, i.e. >>>>> commit 3be5a52b30aa ("fuse: support writable mmap") since v2.6.26, the >>>>> direct reclaim indeed unconditionally waits for PG_writeback flag being >>>>> cleared. At that time the direct reclaim is implemented in a two-stage >>>>> style, stage 1) pass over the LRU list to start parallel writeback >>>>> asynchronously, and stage 2) synchronously wait for completion of the >>>>> writeback previously started. >>>>> >>>>> This two-stage design and the unconditionally waiting for PG_writeback >>>>> flag being cleared is removed by commit 41ac199 ("mm: vmscan: do not >>>>> stall on writeback during memory compaction") since v3.5. >>>>> >>>>> Though the direct reclaim logic continues to evolve and the waiting is >>>>> added back, now the stall will happen only when the direct reclaim is >>>>> triggered from kswapd or memory cgroup. >>>>> >>>>> Specifically the stall will only happen in following certain conditions >>>>> (see shrink_folio_list() for details): >>>>> 1) kswapd >>>>> 2) or it's a user process under a non-root memory cgroup (actually >>>>> cgroup_v1) with GFP_IO permitted >>>>> >>>>> Thus the potential deadlock does not exist actually (if I'm not wrong) if: >>>>> 1) cgroup is not enabled >>>>> 2) or cgroup_v2 is actually used >>>>> 3) or (memory cgroup is enabled and is attached upon cgroup_v1) the fuse >>>>> server actually resides under the root cgroup >>>>> 4) or (the fuse server resides under a non-root memory cgroup_v1), but >>>>> the fuse server advertises itself as a PR_IO_FLUSHER[1] >>>>> >>>>> >>>>> Then we could considering adding a new feature bit indicating that any >>>>> one of the above condition is met and thus the fuse server is safe from >>>>> the potential deadlock inside direct reclaim. When this feature bit is >>>>> set, the kernel side could bypass the temp page copying when doing >>>>> writeback. >>>>> >>>> >>>> Hi Jingbo, thanks for sharing your analysis of this. >>>> >>>> Having the temp page copying gated on the conditions you mentioned >>>> above seems a bit brittle to me. My understanding is that the mm code >>>> for when it decides to stall or not stall can change anytime in the >>>> future, in which case that seems like it could automatically break our >>>> precondition assumptions. >>> >>> So this is why PR_IO_FLUSHER is introduced here, which is specifically >>> for user space components playing a role in IO stack, e.g. fuse daemon, >>> tcmu/nbd daemon, etc. PR_IO_FLUSHER offers guarantee similar to >>> GFP_NOIO, but for user space components. At least we can rely on the >>> assumption that mm would take PR_IO_FLUSHER into account. >>> >>> The limitation of the PR_IO_FLUSHER approach is that, as pointed by >>> Miklos[1], there may be multiple components or services involved to >>> service the fuse requests, and the kernel side has no effective way to >>> check if all services in the whole chain have set PR_IO_FLUSHER. >>> >> >> Right, so doesn't that still bring us back to the original problem >> where if we gate this on any of the one conditions being enough to >> bypass needing the temp page, if the conditions change anytime in the >> future in the mm code, then this would automatically open up the >> potential deadlock in fuse as a byproduct? That seems a bit brittle to >> me to have this dependency. >> > > Hi Jingbo, > > I had some talks with Josef about this during/after LPC and he came up > with the idea of adding a flag to the 'struct > address_space_operations' to indicate that a folio under writeback > should be skipped during reclaim if it gets to that 3rd case of the > legacy cgroupv1 encountering a folio that has been marked for reclaim. > imo this seems like the most elegant solution and allows us to remove > the temporary folio and rb tree entirely. I sent out a patch for this > https://lore.kernel.org/linux-fsdevel/20241011223434.1307300-1-joannelkoong@xxxxxxxxx/ > that I cc'ed you on, the benchmarks show roughly a 20% improvement in > throughput for 4k block size writes and a 40% improvement for 1M block > size writes. I'm curious to see if this speeds up writeback for you as > well on the workloads you're running. > Yeah I just saw your post this morning. It would be best if the mm folks are happy with the modification to the memory reclaiming code. I haven't tested your patch yet, I will test it later. But at least my previous test of dropping the temp page copying shows ~100% 1M write bandwidth improvement (4GB/s->9GB/s), though it was tested upon Bernd's passthrough_hp [1] which bypasses all buffer copying (i.e. the daemon replies immediately without copying the data buffer). [1] https://github.com/libfuse/libfuse/pull/807 -- Thanks, Jingbo