Hi, On 12/20/24 12:41 AM, David Hildenbrand wrote: > On 19.12.24 17:40, Shakeel Butt wrote: >> On Thu, Dec 19, 2024 at 05:29:08PM +0100, David Hildenbrand wrote: >> [...] >>>> >>>> If you check the code just above this patch, this >>>> mapping_writeback_indeterminate() check only happen for pages under >>>> writeback which is a temp state. Anyways, fuse folios should not be >>>> unmovable for their lifetime but only while under writeback which is >>>> same for all fs. >>> >>> But there, writeback is expected to be a temporary thing, not possibly: >>> "AS_WRITEBACK_INDETERMINATE", that is a BIG difference. >>> >>> I'll have to NACK anything that violates ZONE_MOVABLE / ALLOC_CMA >>> guarantees, and unfortunately, it sounds like this is the case here, >>> unless >>> I am missing something important. >>> >> >> It might just be the name "AS_WRITEBACK_INDETERMINATE" is causing >> the confusion. The writeback state is not indefinite. A proper fuse fs, >> like anyother fs, should handle writeback pages appropriately. These >> additional checks and skips are for (I think) untrusted fuse servers. > > Can unprivileged user space provoke this case? > There are some details on the initial problem that FUSE community wants to fix [1]. In summary, a non-malicious fuse daemon may need to allocate some memory when processing a FUSE_WRITE request (initiated from the writeback routine), in which case memory reclaim and compaction is triggered when allocating memory, which in turn leads to waiting on the writeback of **FUSE** dirty pages (which itself waits for the fuse daemon to handle it) - a deadlock here. The current FUSE implementation fixes this by introducing "temp page" in the writeback routine for FUSE. In short, a temporary page (allocated from ZONE_UNMOVABLE) is allocated for each dirty page cache needs to be written back. The content is copied from the original page cache to the temporary page. And then the original page cache (to writeback, allocated from ZONE_MOVABLE) clears PG_writeback bit immediately, so that the fuse daemon won't possibly stuck in deadlock waiting for the writeback of FUSE page cache. Instead, the actual writeback work is done upon the cloned temporary page then. Thus there are actually two pages for each FUSE page cache, one is the original FUSE page cache (in ZONE_MOVABLE) and the other is the temporary page (in ZONE_UNMOVABLE). - For the original page cache, it will clear PG_writeback bit very quickly in the writeback routine and won't block the memory direct reclaim and compaction at all - As for the temporary page, in the normal case, the fuse server will complete FUSE_WRITE request as expected, and thus the temporary page will get freed soon. However FUSE supports unprivileged mount, in which case the fuse daemon is run and mounted by an unprivileged user. Thus the backend fuse daemon may be malicious (started by an unprivileged user) and refuses to process any FUSE requests. Thus in the worst case, these temporary pages will never complete writeback and get pinned in ZONE_UNMOVABLE forever. (One thing worth noting is that, once the fuse daemon gets killed, the whole FUSE filesystem will be aborted, all inflight FUSE requests are flushed, and all the temporary pages will be freed then) What this patchset does is to drop the temporary page design in the FUSE writeback routine, while this patch is introduced to avoid the above mentioned deadlock for a *sane* FUSE daemon in memory compaction after dropping the temp page design. Currently the FUSE writeback pages (i.e. FUSE page cache) is allocated from GFP_HIGHUSER_MOVABLE, which is consistent with other filesystems. In the normal case (the FUSE is backed by a well-behaved FUSE daemon), the page cache will be completed in a reasonable manner and it won't affect the usability of ZONE_MOVABLE. While in the worst case (a malicious FUSE daemon run by an unprivileged user), these page cache in ZONE_MOVABLE can be pinned there indefinitely. We can argue that in the current implementation (without this patch series), ZONE_UNMOVABLE can also grow larger and larger, and pin quite many memory usage (correct me if I'm wrong) in the worst case. In this degree this patch doesn't make things even worse. Besides FUSE enables strictlimit feature by default, in which each FUSE filesystem can consume at most 1% of global vm.dirty_background_thresh before write throttle is triggered. [1] https://lore.kernel.org/all/8eec0912-7a6c-4387-b9be-6718f438a111@xxxxxxxxxxxxxxxxx/ -- Thanks, Jingbo