Re: [HELP] FUSE writeback performance bottleneck

Jingbo Xu <jefflexu@xxxxxxxxxxxxxxxxx> · Mon, 14 Oct 2024 09:57:30 +0800

On 10/12/24 7:08 AM, Joanne Koong wrote:
> On Fri, Sep 13, 2024 at 1:55 PM Joanne Koong <joannelkoong@xxxxxxxxx> wrote:
>>
>> On Thu, Sep 12, 2024 at 8:35 PM Jingbo Xu <jefflexu@xxxxxxxxxxxxxxxxx> wrote:
>>>
>>> On 9/13/24 7:18 AM, Joanne Koong wrote:
>>>> On Wed, Sep 11, 2024 at 2:32 AM Jingbo Xu <jefflexu@xxxxxxxxxxxxxxxxx> wrote:
>>>>>
>>>>> Hi all,
>>>>>
>>>>> On 6/4/24 3:27 PM, Miklos Szeredi wrote:
>>>>>> On Tue, 4 Jun 2024 at 03:57, Jingbo Xu <jefflexu@xxxxxxxxxxxxxxxxx> wrote:
>>>>>>
>>>>>>> IIUC, there are two sources that may cause deadlock:
>>>>>>> 1) the fuse server needs memory allocation when processing FUSE_WRITE
>>>>>>> requests, which in turn triggers direct memory reclaim, and FUSE
>>>>>>> writeback then - deadlock here
>>>>>>
>>>>>> Yep, see the folio_wait_writeback() call deep in the guts of direct
>>>>>> reclaim, which sleeps until the PG_writeback flag is cleared.  If that
>>>>>> happens to be triggered by the writeback in question, then that's a
>>>>>> deadlock.
>>>>>
>>>>> After diving deep into the direct reclaim code, there are some insights
>>>>> may be helpful.
>>>>>
>>>>> Back to the time when the support for fuse writeback is introduced, i.e.
>>>>> commit 3be5a52b30aa ("fuse: support writable mmap") since v2.6.26, the
>>>>> direct reclaim indeed unconditionally waits for PG_writeback flag being
>>>>> cleared.  At that time the direct reclaim is implemented in a two-stage
>>>>> style, stage 1) pass over the LRU list to start parallel writeback
>>>>> asynchronously, and stage 2) synchronously wait for completion of the
>>>>> writeback previously started.
>>>>>
>>>>> This two-stage design and the unconditionally waiting for PG_writeback
>>>>> flag being cleared is removed by commit 41ac199 ("mm: vmscan: do not
>>>>> stall on writeback during memory compaction") since v3.5.
>>>>>
>>>>> Though the direct reclaim logic continues to evolve and the waiting is
>>>>> added back, now the stall will happen only when the direct reclaim is
>>>>> triggered from kswapd or memory cgroup.
>>>>>
>>>>> Specifically the stall will only happen in following certain conditions
>>>>> (see shrink_folio_list() for details):
>>>>> 1) kswapd
>>>>> 2) or it's a user process under a non-root memory cgroup (actually
>>>>> cgroup_v1) with GFP_IO permitted
>>>>>
>>>>> Thus the potential deadlock does not exist actually (if I'm not wrong) if:
>>>>> 1) cgroup is not enabled
>>>>> 2) or cgroup_v2 is actually used
>>>>> 3) or (memory cgroup is enabled and is attached upon cgroup_v1) the fuse
>>>>> server actually resides under the root cgroup
>>>>> 4) or (the fuse server resides under a non-root memory cgroup_v1), but
>>>>> the fuse server advertises itself as a PR_IO_FLUSHER[1]
>>>>>
>>>>>
>>>>> Then we could considering adding a new feature bit indicating that any
>>>>> one of the above condition is met and thus the fuse server is safe from
>>>>> the potential deadlock inside direct reclaim.  When this feature bit is
>>>>> set, the kernel side could bypass the temp page copying when doing
>>>>> writeback.
>>>>>
>>>>
>>>> Hi Jingbo, thanks for sharing your analysis of this.
>>>>
>>>> Having the temp page copying gated on the conditions you mentioned
>>>> above seems a bit brittle to me. My understanding is that the mm code
>>>> for when it decides to stall or not stall can change anytime in the
>>>> future, in which case that seems like it could automatically break our
>>>> precondition assumptions.
>>>
>>> So this is why PR_IO_FLUSHER is introduced here, which is specifically
>>> for user space components playing a role in IO stack, e.g. fuse daemon,
>>> tcmu/nbd daemon, etc.  PR_IO_FLUSHER offers guarantee similar to
>>> GFP_NOIO, but for user space components.  At least we can rely on the
>>> assumption that mm would take PR_IO_FLUSHER into account.
>>>
>>> The limitation of the PR_IO_FLUSHER approach is that, as pointed by
>>> Miklos[1], there may be multiple components or services involved to
>>> service the fuse requests, and the kernel side has no effective way to
>>> check if all services in the whole chain have set PR_IO_FLUSHER.
>>>
>>
>> Right, so doesn't that still bring us back to the original problem
>> where if we gate this on any of the one conditions being enough to
>> bypass needing the temp page, if the conditions change anytime in the
>> future in the mm code, then this would automatically open up the
>> potential deadlock in fuse as a byproduct? That seems a bit brittle to
>> me to have this dependency.
>>
> 
> Hi Jingbo,
> 
> I had some talks with Josef about this during/after LPC and he came up
> with the idea of adding a flag to the 'struct
> address_space_operations' to indicate that a folio under writeback
> should be skipped during reclaim if it gets to that 3rd case of the
> legacy cgroupv1 encountering a folio that has been marked for reclaim.
> imo this seems like the most elegant solution and allows us to remove
> the temporary folio and rb tree entirely. I sent out a patch for this
> https://lore.kernel.org/linux-fsdevel/20241011223434.1307300-1-joannelkoong@xxxxxxxxx/
> that I cc'ed you on, the benchmarks show roughly a 20% improvement in
> throughput for 4k block size writes and a 40% improvement for 1M block
> size writes. I'm curious to see if this speeds up writeback for you as
> well on the workloads you're running.
> 

Yeah I just saw your post this morning.  It would be best if the mm
folks are happy with the modification to the memory reclaiming code.

I haven't tested your patch yet, I will test it later.  But at least my
previous test of dropping the temp page copying shows ~100% 1M write
bandwidth improvement (4GB/s->9GB/s), though it was tested upon Bernd's
passthrough_hp [1] which bypasses all buffer copying (i.e. the daemon
replies immediately without copying the data buffer).

[1] https://github.com/libfuse/libfuse/pull/807

-- 
Thanks,
Jingbo