Re: [HELP] FUSE writeback performance bottleneck

Joanne Koong <joannelkoong@xxxxxxxxx> · Fri, 11 Oct 2024 16:08:39 -0700

On Fri, Sep 13, 2024 at 1:55 PM Joanne Koong <joannelkoong@xxxxxxxxx> wrote:
>
> On Thu, Sep 12, 2024 at 8:35 PM Jingbo Xu <jefflexu@xxxxxxxxxxxxxxxxx> wrote:
> >
> > On 9/13/24 7:18 AM, Joanne Koong wrote:
> > > On Wed, Sep 11, 2024 at 2:32 AM Jingbo Xu <jefflexu@xxxxxxxxxxxxxxxxx> wrote:
> > >>
> > >> Hi all,
> > >>
> > >> On 6/4/24 3:27 PM, Miklos Szeredi wrote:
> > >>> On Tue, 4 Jun 2024 at 03:57, Jingbo Xu <jefflexu@xxxxxxxxxxxxxxxxx> wrote:
> > >>>
> > >>>> IIUC, there are two sources that may cause deadlock:
> > >>>> 1) the fuse server needs memory allocation when processing FUSE_WRITE
> > >>>> requests, which in turn triggers direct memory reclaim, and FUSE
> > >>>> writeback then - deadlock here
> > >>>
> > >>> Yep, see the folio_wait_writeback() call deep in the guts of direct
> > >>> reclaim, which sleeps until the PG_writeback flag is cleared.  If that
> > >>> happens to be triggered by the writeback in question, then that's a
> > >>> deadlock.
> > >>
> > >> After diving deep into the direct reclaim code, there are some insights
> > >> may be helpful.
> > >>
> > >> Back to the time when the support for fuse writeback is introduced, i.e.
> > >> commit 3be5a52b30aa ("fuse: support writable mmap") since v2.6.26, the
> > >> direct reclaim indeed unconditionally waits for PG_writeback flag being
> > >> cleared.  At that time the direct reclaim is implemented in a two-stage
> > >> style, stage 1) pass over the LRU list to start parallel writeback
> > >> asynchronously, and stage 2) synchronously wait for completion of the
> > >> writeback previously started.
> > >>
> > >> This two-stage design and the unconditionally waiting for PG_writeback
> > >> flag being cleared is removed by commit 41ac199 ("mm: vmscan: do not
> > >> stall on writeback during memory compaction") since v3.5.
> > >>
> > >> Though the direct reclaim logic continues to evolve and the waiting is
> > >> added back, now the stall will happen only when the direct reclaim is
> > >> triggered from kswapd or memory cgroup.
> > >>
> > >> Specifically the stall will only happen in following certain conditions
> > >> (see shrink_folio_list() for details):
> > >> 1) kswapd
> > >> 2) or it's a user process under a non-root memory cgroup (actually
> > >> cgroup_v1) with GFP_IO permitted
> > >>
> > >> Thus the potential deadlock does not exist actually (if I'm not wrong) if:
> > >> 1) cgroup is not enabled
> > >> 2) or cgroup_v2 is actually used
> > >> 3) or (memory cgroup is enabled and is attached upon cgroup_v1) the fuse
> > >> server actually resides under the root cgroup
> > >> 4) or (the fuse server resides under a non-root memory cgroup_v1), but
> > >> the fuse server advertises itself as a PR_IO_FLUSHER[1]
> > >>
> > >>
> > >> Then we could considering adding a new feature bit indicating that any
> > >> one of the above condition is met and thus the fuse server is safe from
> > >> the potential deadlock inside direct reclaim.  When this feature bit is
> > >> set, the kernel side could bypass the temp page copying when doing
> > >> writeback.
> > >>
> > >
> > > Hi Jingbo, thanks for sharing your analysis of this.
> > >
> > > Having the temp page copying gated on the conditions you mentioned
> > > above seems a bit brittle to me. My understanding is that the mm code
> > > for when it decides to stall or not stall can change anytime in the
> > > future, in which case that seems like it could automatically break our
> > > precondition assumptions.
> >
> > So this is why PR_IO_FLUSHER is introduced here, which is specifically
> > for user space components playing a role in IO stack, e.g. fuse daemon,
> > tcmu/nbd daemon, etc.  PR_IO_FLUSHER offers guarantee similar to
> > GFP_NOIO, but for user space components.  At least we can rely on the
> > assumption that mm would take PR_IO_FLUSHER into account.
> >
> > The limitation of the PR_IO_FLUSHER approach is that, as pointed by
> > Miklos[1], there may be multiple components or services involved to
> > service the fuse requests, and the kernel side has no effective way to
> > check if all services in the whole chain have set PR_IO_FLUSHER.
> >
>
> Right, so doesn't that still bring us back to the original problem
> where if we gate this on any of the one conditions being enough to
> bypass needing the temp page, if the conditions change anytime in the
> future in the mm code, then this would automatically open up the
> potential deadlock in fuse as a byproduct? That seems a bit brittle to
> me to have this dependency.
>

Hi Jingbo,

I had some talks with Josef about this during/after LPC and he came up
with the idea of adding a flag to the 'struct
address_space_operations' to indicate that a folio under writeback
should be skipped during reclaim if it gets to that 3rd case of the
legacy cgroupv1 encountering a folio that has been marked for reclaim.
imo this seems like the most elegant solution and allows us to remove
the temporary folio and rb tree entirely. I sent out a patch for this
https://lore.kernel.org/linux-fsdevel/20241011223434.1307300-1-joannelkoong@xxxxxxxxx/
that I cc'ed you on, the benchmarks show roughly a 20% improvement in
throughput for 4k block size writes and a 40% improvement for 1M block
size writes. I'm curious to see if this speeds up writeback for you as
well on the workloads you're running.

Thanks,
Joanne

> The other alternatives seem to be:
> * adding a timer to writeback requests [1] where if the pages have not
> been copied out to userspace by a certain amount of time, then the
> handler copies out those pages to temporary pages and immediately
> clears writeback on the pages. The timer is canceled as soon as the
> pages will be copied out to userspace.
> * (not sure how possible this is) add some way to tag pages being
> reclaimed/balanced (I saw your comment below about the
> ->migrate_folio() call, which I need to look more into)
>
> The timeout option seems like the most promising one. I don't think
> the code would be that ugly.
>
> Curious to hear your thoughts on this. Are there any other
> alternatives you think could work here?
>
>
> [1] https://lore.kernel.org/all/CAJfpegt_mEYOeeTo2bWS3iJfC38t5bf29mzrxK68dhMptrgamg@xxxxxxxxxxxxxx/
>
> >
> > > Additionally, if I'm understanding it
> > > correctly, we also would need to know if the writeback is being
> > > triggered from reclaim by kswapd - is there even a way in the kernel
> > > to check that?
> >
> > Nope.  What I mean in the previous email is that, kswapd can get stalled
> > in direct reclaim, while the normal process, e.g. the fuse server, may
> > not get stalled in certain condition, e.g. explicitly advertising
> > PR_IO_FLUSHER.
> >
>
> Gotcha. I just took a look at shrink_folio_list() and now I see the
> "current_is_kswapd()" check.
>
> > >
> > > I'm wondering if there's some way we could tell if a folio is under
> > > reclaim when we're writing it back. I'm not familiar yet with the
> > > reclaim code, but my initial thoughts were whether it'd be possible to
> > > purpose the PG_reclaim flag or perhaps if the folio is not on any lru
> > > list, as an indication that it's being reclaimed. We could then just
> > > use the temp page in those cases, and skip the temp page otherwise.
> >
> > That is a good idea but I'm afraid it doesn't works.  Explained below.
> >
> > >
> > > Could you also point me to where in the reclaim code we end up
> > > invoking the writeback callback? I see pageout() calls ->writepage()
> > > but I'm not seeing where we invoke ->writepages().
> >
> > Yes, the direct reclaim would end up calling ->writepage() to writeback
> > the dirty page.  ->writepages() is only called in normal writeback
> > routine, e.g. when triggered from balance_dirty_page().
> >
> > Also FYI FUSE has removed ->writepage() since commit e1c420a ("fuse:
> > Remove fuse_writepage"), and now it relies on ->migrate_folio(), i.e.
> > memory compacting and the normal writeback routine (triggered from
> > balance_dirty_page()) in low memory.
> >
> > Thus I'm afraid the approach of doing temp page copying only for
> > writeback from direct reclaim code actually doesn't work.  That's
> > because when doing the direct reclaim, the process not only waits for
> > the writeback completion submitted from direct reclaim (e.g. marked with
> > PG_reclaim, by ->writepage), but may also waits for that submitted from
> > the normal writeback routine (without PG_reclaim marked, by
> > ->writepages). See commit c3b94f4 ("memcg: further prevent OOM with too
> > many dirty pages").
> >
>
> Thanks for the explanation! This is very helpful. The reliance on
> ->migrate_folio() for reclaim is the piece I was missing.
>
> >
> >
> > [1]
> > https://lore.kernel.org/all/CAJfpegvYpWuTbKOm1hoySHZocY+ki07EzcXBUX8kZx92T8W6uQ@xxxxxxxxxxxxxx/
> >
> > --
> > Thanks,
> > Jingbo
>
> Thanks,
> Joanne