On Fri, Sep 13, 2024 at 1:55 PM Joanne Koong <joannelkoong@xxxxxxxxx> wrote: > > On Thu, Sep 12, 2024 at 8:35 PM Jingbo Xu <jefflexu@xxxxxxxxxxxxxxxxx> wrote: > > > > On 9/13/24 7:18 AM, Joanne Koong wrote: > > > On Wed, Sep 11, 2024 at 2:32 AM Jingbo Xu <jefflexu@xxxxxxxxxxxxxxxxx> wrote: > > >> > > >> Hi all, > > >> > > >> On 6/4/24 3:27 PM, Miklos Szeredi wrote: > > >>> On Tue, 4 Jun 2024 at 03:57, Jingbo Xu <jefflexu@xxxxxxxxxxxxxxxxx> wrote: > > >>> > > >>>> IIUC, there are two sources that may cause deadlock: > > >>>> 1) the fuse server needs memory allocation when processing FUSE_WRITE > > >>>> requests, which in turn triggers direct memory reclaim, and FUSE > > >>>> writeback then - deadlock here > > >>> > > >>> Yep, see the folio_wait_writeback() call deep in the guts of direct > > >>> reclaim, which sleeps until the PG_writeback flag is cleared. If that > > >>> happens to be triggered by the writeback in question, then that's a > > >>> deadlock. > > >> > > >> After diving deep into the direct reclaim code, there are some insights > > >> may be helpful. > > >> > > >> Back to the time when the support for fuse writeback is introduced, i.e. > > >> commit 3be5a52b30aa ("fuse: support writable mmap") since v2.6.26, the > > >> direct reclaim indeed unconditionally waits for PG_writeback flag being > > >> cleared. At that time the direct reclaim is implemented in a two-stage > > >> style, stage 1) pass over the LRU list to start parallel writeback > > >> asynchronously, and stage 2) synchronously wait for completion of the > > >> writeback previously started. > > >> > > >> This two-stage design and the unconditionally waiting for PG_writeback > > >> flag being cleared is removed by commit 41ac199 ("mm: vmscan: do not > > >> stall on writeback during memory compaction") since v3.5. > > >> > > >> Though the direct reclaim logic continues to evolve and the waiting is > > >> added back, now the stall will happen only when the direct reclaim is > > >> triggered from kswapd or memory cgroup. > > >> > > >> Specifically the stall will only happen in following certain conditions > > >> (see shrink_folio_list() for details): > > >> 1) kswapd > > >> 2) or it's a user process under a non-root memory cgroup (actually > > >> cgroup_v1) with GFP_IO permitted > > >> > > >> Thus the potential deadlock does not exist actually (if I'm not wrong) if: > > >> 1) cgroup is not enabled > > >> 2) or cgroup_v2 is actually used > > >> 3) or (memory cgroup is enabled and is attached upon cgroup_v1) the fuse > > >> server actually resides under the root cgroup > > >> 4) or (the fuse server resides under a non-root memory cgroup_v1), but > > >> the fuse server advertises itself as a PR_IO_FLUSHER[1] > > >> > > >> > > >> Then we could considering adding a new feature bit indicating that any > > >> one of the above condition is met and thus the fuse server is safe from > > >> the potential deadlock inside direct reclaim. When this feature bit is > > >> set, the kernel side could bypass the temp page copying when doing > > >> writeback. > > >> > > > > > > Hi Jingbo, thanks for sharing your analysis of this. > > > > > > Having the temp page copying gated on the conditions you mentioned > > > above seems a bit brittle to me. My understanding is that the mm code > > > for when it decides to stall or not stall can change anytime in the > > > future, in which case that seems like it could automatically break our > > > precondition assumptions. > > > > So this is why PR_IO_FLUSHER is introduced here, which is specifically > > for user space components playing a role in IO stack, e.g. fuse daemon, > > tcmu/nbd daemon, etc. PR_IO_FLUSHER offers guarantee similar to > > GFP_NOIO, but for user space components. At least we can rely on the > > assumption that mm would take PR_IO_FLUSHER into account. > > > > The limitation of the PR_IO_FLUSHER approach is that, as pointed by > > Miklos[1], there may be multiple components or services involved to > > service the fuse requests, and the kernel side has no effective way to > > check if all services in the whole chain have set PR_IO_FLUSHER. > > > > Right, so doesn't that still bring us back to the original problem > where if we gate this on any of the one conditions being enough to > bypass needing the temp page, if the conditions change anytime in the > future in the mm code, then this would automatically open up the > potential deadlock in fuse as a byproduct? That seems a bit brittle to > me to have this dependency. > Hi Jingbo, I had some talks with Josef about this during/after LPC and he came up with the idea of adding a flag to the 'struct address_space_operations' to indicate that a folio under writeback should be skipped during reclaim if it gets to that 3rd case of the legacy cgroupv1 encountering a folio that has been marked for reclaim. imo this seems like the most elegant solution and allows us to remove the temporary folio and rb tree entirely. I sent out a patch for this https://lore.kernel.org/linux-fsdevel/20241011223434.1307300-1-joannelkoong@xxxxxxxxx/ that I cc'ed you on, the benchmarks show roughly a 20% improvement in throughput for 4k block size writes and a 40% improvement for 1M block size writes. I'm curious to see if this speeds up writeback for you as well on the workloads you're running. Thanks, Joanne > The other alternatives seem to be: > * adding a timer to writeback requests [1] where if the pages have not > been copied out to userspace by a certain amount of time, then the > handler copies out those pages to temporary pages and immediately > clears writeback on the pages. The timer is canceled as soon as the > pages will be copied out to userspace. > * (not sure how possible this is) add some way to tag pages being > reclaimed/balanced (I saw your comment below about the > ->migrate_folio() call, which I need to look more into) > > The timeout option seems like the most promising one. I don't think > the code would be that ugly. > > Curious to hear your thoughts on this. Are there any other > alternatives you think could work here? > > > [1] https://lore.kernel.org/all/CAJfpegt_mEYOeeTo2bWS3iJfC38t5bf29mzrxK68dhMptrgamg@xxxxxxxxxxxxxx/ > > > > > > Additionally, if I'm understanding it > > > correctly, we also would need to know if the writeback is being > > > triggered from reclaim by kswapd - is there even a way in the kernel > > > to check that? > > > > Nope. What I mean in the previous email is that, kswapd can get stalled > > in direct reclaim, while the normal process, e.g. the fuse server, may > > not get stalled in certain condition, e.g. explicitly advertising > > PR_IO_FLUSHER. > > > > Gotcha. I just took a look at shrink_folio_list() and now I see the > "current_is_kswapd()" check. > > > > > > > I'm wondering if there's some way we could tell if a folio is under > > > reclaim when we're writing it back. I'm not familiar yet with the > > > reclaim code, but my initial thoughts were whether it'd be possible to > > > purpose the PG_reclaim flag or perhaps if the folio is not on any lru > > > list, as an indication that it's being reclaimed. We could then just > > > use the temp page in those cases, and skip the temp page otherwise. > > > > That is a good idea but I'm afraid it doesn't works. Explained below. > > > > > > > > Could you also point me to where in the reclaim code we end up > > > invoking the writeback callback? I see pageout() calls ->writepage() > > > but I'm not seeing where we invoke ->writepages(). > > > > Yes, the direct reclaim would end up calling ->writepage() to writeback > > the dirty page. ->writepages() is only called in normal writeback > > routine, e.g. when triggered from balance_dirty_page(). > > > > Also FYI FUSE has removed ->writepage() since commit e1c420a ("fuse: > > Remove fuse_writepage"), and now it relies on ->migrate_folio(), i.e. > > memory compacting and the normal writeback routine (triggered from > > balance_dirty_page()) in low memory. > > > > Thus I'm afraid the approach of doing temp page copying only for > > writeback from direct reclaim code actually doesn't work. That's > > because when doing the direct reclaim, the process not only waits for > > the writeback completion submitted from direct reclaim (e.g. marked with > > PG_reclaim, by ->writepage), but may also waits for that submitted from > > the normal writeback routine (without PG_reclaim marked, by > > ->writepages). See commit c3b94f4 ("memcg: further prevent OOM with too > > many dirty pages"). > > > > Thanks for the explanation! This is very helpful. The reliance on > ->migrate_folio() for reclaim is the piece I was missing. > > > > > > > [1] > > https://lore.kernel.org/all/CAJfpegvYpWuTbKOm1hoySHZocY+ki07EzcXBUX8kZx92T8W6uQ@xxxxxxxxxxxxxx/ > > > > -- > > Thanks, > > Jingbo > > Thanks, > Joanne