On Mon, Dec 30, 2024 at 10:38:16AM -0800, Joanne Koong wrote: > On Mon, Dec 30, 2024 at 2:16 AM David Hildenbrand <david@xxxxxxxxxx> wrote: Thanks David for the response. > > > > >> BTW, I just looked at NFS out of interest, in particular > > >> nfs_page_async_flush(), and I spot some logic about re-dirtying pages + > > >> canceling writeback. IIUC, there are default timeouts for UDP and TCP, > > >> whereby the TCP default one seems to be around 60s (* retrans?), and the > > >> privileged user that mounts it can set higher ones. I guess one could run > > >> into similar writeback issues? > > > > > > > Hi, > > > > sorry for the late reply. > > > > > Yes, I think so. > > > > > >> > > >> So I wonder why we never required AS_WRITEBACK_INDETERMINATE for nfs? > > > > > > I feel like INDETERMINATE in the name is the main cause of confusion. > > > > We are adding logic that says "unconditionally, never wait on writeback > > for these folios, not even any sync migration". That's the main problem > > I have. > > > > Your explanation below is helpful. Because ... > > > > > So, let me explain why it is required (but later I will tell you how it > > > can be avoided). The FUSE thread which is actively handling writeback of > > > a given folio can cause memory allocation either through syscall or page > > > fault. That memory allocation can trigger global reclaim synchronously > > > and in cgroup-v1, that FUSE thread can wait on the writeback on the same > > > folio whose writeback it is supposed to end and cauing a deadlock. So, > > > AS_WRITEBACK_INDETERMINATE is used to just avoid this deadlock. > > > > The in-kernel fs avoid this situation through the use of GFP_NOFS > > > allocations. The userspace fs can also use a similar approach which is > > > prctl(PR_SET_IO_FLUSHER, 1) to avoid this situation. However I have been > > > told that it is hard to use as it is per-thread flag and has to be set > > > for all the threads handling writeback which can be error prone if the > > > threadpool is dynamic. Second it is very coarse such that all the > > > allocations from those threads (e.g. page faults) become NOFS which > > > makes userspace very unreliable on highly utilized machine as NOFS can > > > not reclaim potentially a lot of memory and can not trigger oom-kill. > > > > > > > ... now I understand that we want to prevent a deadlock in one specific > > scenario only? > > > > What sounds plausible for me is: > > > > a) Make this only affect the actual deadlock path: sync migration > > during compaction. Communicate it either using some "context" > > information or with a new MIGRATE_SYNC_COMPACTION. > > b) Call it sth. like AS_WRITEBACK_MIGHT_DEADLOCK_ON_RECLAIM to express > > that very deadlock problem. > > c) Leave all others sync migration users alone for now > > The deadlock path is separate from sync migration. The deadlock arises > from a corner case where cgroupv1 reclaim waits on a folio under > writeback where that writeback itself is blocked on reclaim. > Joanne, let's drop the patch to migrate.c completely and let's rename the flag to something like what David is suggesting and only handle in the reclaim path. > > > > Would that prevent the deadlock? Even *better* would be to to be able to > > ask the fs if starting writeback on a specific folio could deadlock. > > Because in most cases, as I understand, we'll not actually run into the > > deadlock and would just want to wait for writeback to just complete > > (esp. compaction). > > > > (I still think having folios under writeback for a long time might be a > > problem, but that's indeed something to sort out separately in the > > future, because I suspect NFS has similar issues. We'd want to "wait > > with timeout" and e.g., cancel writeback during memory > > offlining/alloc_cma ...) Thanks David and yes let's handle the folios under writeback issue separately. > > I'm looking back at some of the discussions in v2 [1] and I'm still > not clear on how memory fragmentation for non-movable pages differs > from memory fragmentation from movable pages and whether one is worse > than the other. I think the fragmentation due to movable pages becoming unmovable is worse as that situation is unexpected and the kernel can waste a lot of CPU to defrag the block containing those folios. For non-movable blocks, the kernel will not even try to defrag. Now we can have a situation where almost all memory is backed by non-movable blocks and higher order allocations start failing even when there is enough free memory. For such situations either system needs to be restarted (or workloads restarted if they are cause of high non-movable memory) or the admin needs to setup ZONE_MOVABLE where non-movable allocations don't go. > Currently fuse uses movable temp pages (allocated with > gfp flags GFP_NOFS | __GFP_HIGHMEM), and these can run into the same > issue where a buggy/malicious server may never complete writeback. So, these temp pages are not an issue for fragmenting the movable blocks but if there is no limit on temp pages, the whole system can become non-movable (there is a case where movable blocks on non-ZONE_MOVABLE can be converted into non-movable blocks under low memory). ZONE_MOVABLE will avoid such scenario but tuning the right size of ZONE_MOVABLE is not easy. > This has the same effect of fragmenting memory and has a worse memory > cost to the system in terms of memory used. With not having temp pages > though, now in this scenario, pages allocated in a movable page block > can't be compacted and that memory is fragmented. My (basic and maybe > incorrect) understanding is that memory gets allocated through a buddy > allocator and moveable vs nonmovable pages get allocated to > corresponding blocks that match their type, but there's no other > difference otherwise. Is this understanding correct? Or is there some > substantial difference between fragmentation for movable vs nonmovable > blocks? The main difference is the fallback of high order allocation which can trigger compaction or background compaction through kcompactd. The kernel will only try to defrag the movable blocks.