Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

On 12/20/24 12:41 AM, David Hildenbrand wrote:
> On 19.12.24 17:40, Shakeel Butt wrote:
>> On Thu, Dec 19, 2024 at 05:29:08PM +0100, David Hildenbrand wrote:
>> [...]
>>>>
>>>> If you check the code just above this patch, this
>>>> mapping_writeback_indeterminate() check only happen for pages under
>>>> writeback which is a temp state. Anyways, fuse folios should not be
>>>> unmovable for their lifetime but only while under writeback which is
>>>> same for all fs.
>>>
>>> But there, writeback is expected to be a temporary thing, not possibly:
>>> "AS_WRITEBACK_INDETERMINATE", that is a BIG difference.
>>>
>>> I'll have to NACK anything that violates ZONE_MOVABLE / ALLOC_CMA
>>> guarantees, and unfortunately, it sounds like this is the case here,
>>> unless
>>> I am missing something important.
>>>
>>
>> It might just be the name "AS_WRITEBACK_INDETERMINATE" is causing
>> the confusion. The writeback state is not indefinite. A proper fuse fs,
>> like anyother fs, should handle writeback pages appropriately. These
>> additional checks and skips are for (I think) untrusted fuse servers.
> 
> Can unprivileged user space provoke this case?
> 

There are some details on the initial problem that FUSE community wants
to fix [1].

In summary, a non-malicious fuse daemon may need to allocate some memory
when processing a FUSE_WRITE request (initiated from the writeback
routine), in which case memory reclaim and compaction is triggered when
allocating memory, which in turn leads to waiting on the writeback of
**FUSE** dirty pages (which itself waits for the fuse daemon to handle
it) - a deadlock here.

The current FUSE implementation fixes this by introducing "temp page" in
the writeback routine for FUSE.  In short, a temporary page (allocated
from ZONE_UNMOVABLE) is allocated for each dirty page cache needs to be
written back.  The content is copied from the original page cache to the
temporary page.  And then the original page cache (to writeback,
allocated from ZONE_MOVABLE) clears PG_writeback bit immediately, so
that the fuse daemon won't possibly stuck in deadlock waiting for the
writeback of FUSE page cache.  Instead, the actual writeback work is
done upon the cloned temporary page then.

Thus there are actually two pages for each FUSE page cache, one is the
original FUSE page cache (in ZONE_MOVABLE) and the other is the
temporary page (in ZONE_UNMOVABLE).

- For the original page cache, it will clear PG_writeback bit very
quickly in the writeback routine and won't block the memory direct
reclaim and compaction at all
- As for the temporary page, in the normal case, the fuse server will
complete FUSE_WRITE request as expected, and thus the temporary page
will get freed soon.

However FUSE supports unprivileged mount, in which case the fuse daemon
is run and mounted by an unprivileged user.  Thus the backend fuse
daemon may be malicious (started by an unprivileged user) and refuses to
process any FUSE requests.  Thus in the worst case, these temporary
pages will never complete writeback and get pinned in ZONE_UNMOVABLE
forever. (One thing worth noting is that, once the fuse daemon gets
killed, the whole FUSE filesystem will be aborted, all inflight FUSE
requests are flushed, and all the temporary pages will be freed then)


What this patchset does is to drop the temporary page design in the FUSE
writeback routine, while this patch is introduced to avoid the above
mentioned deadlock for a *sane* FUSE daemon in memory compaction after
dropping the temp page design.

Currently the FUSE writeback pages (i.e. FUSE page cache) is allocated
from GFP_HIGHUSER_MOVABLE, which is consistent with other filesystems.

In the normal case (the FUSE is backed by a well-behaved FUSE daemon),
the page cache will be completed in a reasonable manner and it won't
affect the usability of ZONE_MOVABLE.

While in the worst case (a malicious FUSE daemon run by an unprivileged
user), these page cache in ZONE_MOVABLE can be pinned there indefinitely.

We can argue that in the current implementation (without this patch
series), ZONE_UNMOVABLE can also grow larger and larger, and pin quite
many memory usage (correct me if I'm wrong) in the worst case.  In this
degree this patch doesn't make things even worse.  Besides FUSE enables
strictlimit feature by default, in which each FUSE filesystem can
consume at most 1% of global vm.dirty_background_thresh before write
throttle is triggered.


[1]
https://lore.kernel.org/all/8eec0912-7a6c-4387-b9be-6718f438a111@xxxxxxxxxxxxxxxxx/


-- 
Thanks,
Jingbo




[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [NTFS 3]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [NTFS 3]     [Samba]     [Device Mapper]     [CEPH Development]

  Powered by Linux