[LSF/MM/BPF TOPIC] Removing writeback temp pages in FUSE

Joanne Koong <joannelkoong@xxxxxxxxx> · Mon, 27 Jan 2025 13:44:15 -0800

Hi all,

Recently, there was a long discussion upstream [1] on a patchset that
removes temp pages when handling writeback in FUSE. Temp pages are the
main bottleneck for write performance in FUSE and local benchmarks
showed approximately a 20% and 45% improvement in throughput for 4K
and 1M block size writes respectively when temp pages were removed.
More information on how FUSE uses temp pages can be found here [2].

In the discussion, there were concerns from mm regarding the
possibility of untrusted malicious or buggy fuse servers never
completing writeback, which would impede migration for those pages.

It would be great to continue this discussion at LSF/MM and align on a
solution that removes FUSE temp pages altogether while satisfying mm’s
expectations for page migration. These are the most promising options
so far:

a) Kill untrusted fuse servers that do not reply to writeback requests
by a certain amount of time (where that time can be configurable
through a sysctl) as a safeguard for system resources

b) Use unmovable pages for untrusted fuse servers

If there are no acceptable solutions, it might also be worth
considering whether there could be mm options that could sufficiently
mitigate this problem. One potential idea is co-locating FUSE folio
allocations to the same page block so that the worst-case
malicious/buggy server scenario only hampers migration of one page
block.

If there is no way to remove temp pages altogether, then it would be
useful to discuss:
a) how skipping temp pages should be gated:
    i) unprivileged servers default to always using temp pages while
privileged servers skip temp pages
    ii) splice defaults to using temp pages and writeback for non-temp
pages get canceled if migration is initiated
    iii) skip temp pages if a sufficient enough request timeout is set

b) how to support large FUSE folios for writeback. Currently FUSE uses
an rb tree to track writeback state of temp pages but with large
folios, this gets unsustainable if concurrent writebacks happen on the
same page indices but are part of different sized folios, eg the
following scenario
      i)  writeback on a large folio is issued
     ii) the folio is copied to a tmp folio and writeback is cleared,
we add this writeback request to the rb tree
     iii) the folio in the pagecache is evicted
     iv) another write occurs on a larger range that encompasses the
range in the writeback in i) or on a subset of it
It seems likely that we will need to align on another data structure
instead of the rb tree to sufficiently handle this.

Thanks,
Joanne

[1] https://lore.kernel.org/linux-fsdevel/20241122232359.429647-5-joannelkoong@xxxxxxxxx/
[2] https://lore.kernel.org/linux-fsdevel/20241122232359.429647-1-joannelkoong@xxxxxxxxx/