Re: [LSF/MM/BPF TOPIC] Removing writeback temp pages in FUSE

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, 2025-01-27 at 13:44 -0800, Joanne Koong wrote:
> Hi all,
> 
> Recently, there was a long discussion upstream [1] on a patchset that
> removes temp pages when handling writeback in FUSE. Temp pages are the
> main bottleneck for write performance in FUSE and local benchmarks
> showed approximately a 20% and 45% improvement in throughput for 4K
> and 1M block size writes respectively when temp pages were removed.
> More information on how FUSE uses temp pages can be found here [2].
> 
> In the discussion, there were concerns from mm regarding the
> possibility of untrusted malicious or buggy fuse servers never
> completing writeback, which would impede migration for those pages.
> 
> It would be great to continue this discussion at LSF/MM and align on a
> solution that removes FUSE temp pages altogether while satisfying mm’s
> expectations for page migration. These are the most promising options
> so far:
> 
> a) Kill untrusted fuse servers that do not reply to writeback requests
> by a certain amount of time (where that time can be configurable
> through a sysctl) as a safeguard for system resources
> 
> b) Use unmovable pages for untrusted fuse servers
> 
> If there are no acceptable solutions, it might also be worth
> considering whether there could be mm options that could sufficiently
> mitigate this problem. One potential idea is co-locating FUSE folio
> allocations to the same page block so that the worst-case
> malicious/buggy server scenario only hampers migration of one page
> block.
> 
> If there is no way to remove temp pages altogether, then it would be
> useful to discuss:
> a) how skipping temp pages should be gated:
>     i) unprivileged servers default to always using temp pages while
> privileged servers skip temp pages
>     ii) splice defaults to using temp pages and writeback for non-temp
> pages get canceled if migration is initiated
>     iii) skip temp pages if a sufficient enough request timeout is set
> 

We might also consider coupling the above measures with a new limit on
the number of unprivileged FUSE mounts a user is allowed to have. IIUC,
a single unprivileged FUSE mount is only allowed a certain amount of
dirty pages, but there is no real cap on the number of mounts that an
unprivileged user can spawn.

A tunable hard cap on the number mounts allowed per uid would be a
reasonable thing to consider. Most users won't need more than 32 or 64
or so.

> b) how to support large FUSE folios for writeback. Currently FUSE uses
> an rb tree to track writeback state of temp pages but with large
> folios, this gets unsustainable if concurrent writebacks happen on the
> same page indices but are part of different sized folios, eg the
> following scenario
>       i)  writeback on a large folio is issued
>      ii) the folio is copied to a tmp folio and writeback is cleared,
> we add this writeback request to the rb tree
>      iii) the folio in the pagecache is evicted
>      iv) another write occurs on a larger range that encompasses the
> range in the writeback in i) or on a subset of it
> It seems likely that we will need to align on another data structure
> instead of the rb tree to sufficiently handle this.
> 
> 
> Thanks,
> Joanne
> 
> [1] https://lore.kernel.org/linux-fsdevel/20241122232359.429647-5-joannelkoong@xxxxxxxxx/
> [2] https://lore.kernel.org/linux-fsdevel/20241122232359.429647-1-joannelkoong@xxxxxxxxx/


Miklos' has a good point about reads being a problem too. In fact, it
might be simpler to start by dealing with reads.

While limiting what we can do with FUSE is all well and good, I wonder
too if we might be able to allow pages to be migrated while reads or
writeback is going on.

Once we submit a request to the backing store, we usually have to wait
a while for the result. During that time, the kernel doesn't usually
touch the page. In some of those cases, might we be able to migrate
pages while in that quiescent window?
-- 
Jeff Layton <jlayton@xxxxxxxxxx>





[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [NTFS 3]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [NTFS 3]     [Samba]     [Device Mapper]     [CEPH Development]

  Powered by Linux