Re: [HELP] FUSE writeback performance bottleneck

Jingbo Xu <jefflexu@xxxxxxxxxxxxxxxxx> · Fri, 13 Sep 2024 11:35:04 +0800

On 9/13/24 7:18 AM, Joanne Koong wrote:
> On Wed, Sep 11, 2024 at 2:32 AM Jingbo Xu <jefflexu@xxxxxxxxxxxxxxxxx> wrote:
>>
>> Hi all,
>>
>> On 6/4/24 3:27 PM, Miklos Szeredi wrote:
>>> On Tue, 4 Jun 2024 at 03:57, Jingbo Xu <jefflexu@xxxxxxxxxxxxxxxxx> wrote:
>>>
>>>> IIUC, there are two sources that may cause deadlock:
>>>> 1) the fuse server needs memory allocation when processing FUSE_WRITE
>>>> requests, which in turn triggers direct memory reclaim, and FUSE
>>>> writeback then - deadlock here
>>>
>>> Yep, see the folio_wait_writeback() call deep in the guts of direct
>>> reclaim, which sleeps until the PG_writeback flag is cleared.  If that
>>> happens to be triggered by the writeback in question, then that's a
>>> deadlock.
>>
>> After diving deep into the direct reclaim code, there are some insights
>> may be helpful.
>>
>> Back to the time when the support for fuse writeback is introduced, i.e.
>> commit 3be5a52b30aa ("fuse: support writable mmap") since v2.6.26, the
>> direct reclaim indeed unconditionally waits for PG_writeback flag being
>> cleared.  At that time the direct reclaim is implemented in a two-stage
>> style, stage 1) pass over the LRU list to start parallel writeback
>> asynchronously, and stage 2) synchronously wait for completion of the
>> writeback previously started.
>>
>> This two-stage design and the unconditionally waiting for PG_writeback
>> flag being cleared is removed by commit 41ac199 ("mm: vmscan: do not
>> stall on writeback during memory compaction") since v3.5.
>>
>> Though the direct reclaim logic continues to evolve and the waiting is
>> added back, now the stall will happen only when the direct reclaim is
>> triggered from kswapd or memory cgroup.
>>
>> Specifically the stall will only happen in following certain conditions
>> (see shrink_folio_list() for details):
>> 1) kswapd
>> 2) or it's a user process under a non-root memory cgroup (actually
>> cgroup_v1) with GFP_IO permitted
>>
>> Thus the potential deadlock does not exist actually (if I'm not wrong) if:
>> 1) cgroup is not enabled
>> 2) or cgroup_v2 is actually used
>> 3) or (memory cgroup is enabled and is attached upon cgroup_v1) the fuse
>> server actually resides under the root cgroup
>> 4) or (the fuse server resides under a non-root memory cgroup_v1), but
>> the fuse server advertises itself as a PR_IO_FLUSHER[1]
>>
>>
>> Then we could considering adding a new feature bit indicating that any
>> one of the above condition is met and thus the fuse server is safe from
>> the potential deadlock inside direct reclaim.  When this feature bit is
>> set, the kernel side could bypass the temp page copying when doing
>> writeback.
>>
> 
> Hi Jingbo, thanks for sharing your analysis of this.
> 
> Having the temp page copying gated on the conditions you mentioned
> above seems a bit brittle to me. My understanding is that the mm code
> for when it decides to stall or not stall can change anytime in the
> future, in which case that seems like it could automatically break our
> precondition assumptions.

So this is why PR_IO_FLUSHER is introduced here, which is specifically
for user space components playing a role in IO stack, e.g. fuse daemon,
tcmu/nbd daemon, etc.  PR_IO_FLUSHER offers guarantee similar to
GFP_NOIO, but for user space components.  At least we can rely on the
assumption that mm would take PR_IO_FLUSHER into account.

The limitation of the PR_IO_FLUSHER approach is that, as pointed by
Miklos[1], there may be multiple components or services involved to
service the fuse requests, and the kernel side has no effective way to
check if all services in the whole chain have set PR_IO_FLUSHER.

> Additionally, if I'm understanding it
> correctly, we also would need to know if the writeback is being
> triggered from reclaim by kswapd - is there even a way in the kernel
> to check that?

Nope.  What I mean in the previous email is that, kswapd can get stalled
in direct reclaim, while the normal process, e.g. the fuse server, may
not get stalled in certain condition, e.g. explicitly advertising
PR_IO_FLUSHER.

> 
> I'm wondering if there's some way we could tell if a folio is under
> reclaim when we're writing it back. I'm not familiar yet with the
> reclaim code, but my initial thoughts were whether it'd be possible to
> purpose the PG_reclaim flag or perhaps if the folio is not on any lru
> list, as an indication that it's being reclaimed. We could then just
> use the temp page in those cases, and skip the temp page otherwise.

That is a good idea but I'm afraid it doesn't works.  Explained below.

> 
> Could you also point me to where in the reclaim code we end up
> invoking the writeback callback? I see pageout() calls ->writepage()
> but I'm not seeing where we invoke ->writepages().

Yes, the direct reclaim would end up calling ->writepage() to writeback
the dirty page.  ->writepages() is only called in normal writeback
routine, e.g. when triggered from balance_dirty_page().

Also FYI FUSE has removed ->writepage() since commit e1c420a ("fuse:
Remove fuse_writepage"), and now it relies on ->migrate_folio(), i.e.
memory compacting and the normal writeback routine (triggered from
balance_dirty_page()) in low memory.

Thus I'm afraid the approach of doing temp page copying only for
writeback from direct reclaim code actually doesn't work.  That's
because when doing the direct reclaim, the process not only waits for
the writeback completion submitted from direct reclaim (e.g. marked with
PG_reclaim, by ->writepage), but may also waits for that submitted from
the normal writeback routine (without PG_reclaim marked, by
->writepages). See commit c3b94f4 ("memcg: further prevent OOM with too
many dirty pages").

[1]
https://lore.kernel.org/all/CAJfpegvYpWuTbKOm1hoySHZocY+ki07EzcXBUX8kZx92T8W6uQ@xxxxxxxxxxxxxx/

-- 
Thanks,
Jingbo