On Mon, Dec 30, 2024 at 12:04 PM Shakeel Butt <shakeel.butt@xxxxxxxxx> wrote: > > On Mon, Dec 30, 2024 at 10:38:16AM -0800, Joanne Koong wrote: > > On Mon, Dec 30, 2024 at 2:16 AM David Hildenbrand <david@xxxxxxxxxx> wrote: > > Thanks David for the response. > > > > > > > >> BTW, I just looked at NFS out of interest, in particular > > > >> nfs_page_async_flush(), and I spot some logic about re-dirtying pages + > > > >> canceling writeback. IIUC, there are default timeouts for UDP and TCP, > > > >> whereby the TCP default one seems to be around 60s (* retrans?), and the > > > >> privileged user that mounts it can set higher ones. I guess one could run > > > >> into similar writeback issues? > > > > > > > > > > Hi, > > > > > > sorry for the late reply. > > > > > > > Yes, I think so. > > > > > > > >> > > > >> So I wonder why we never required AS_WRITEBACK_INDETERMINATE for nfs? > > > > > > > > I feel like INDETERMINATE in the name is the main cause of confusion. > > > > > > We are adding logic that says "unconditionally, never wait on writeback > > > for these folios, not even any sync migration". That's the main problem > > > I have. > > > > > > Your explanation below is helpful. Because ... > > > > > > > So, let me explain why it is required (but later I will tell you how it > > > > can be avoided). The FUSE thread which is actively handling writeback of > > > > a given folio can cause memory allocation either through syscall or page > > > > fault. That memory allocation can trigger global reclaim synchronously > > > > and in cgroup-v1, that FUSE thread can wait on the writeback on the same > > > > folio whose writeback it is supposed to end and cauing a deadlock. So, > > > > AS_WRITEBACK_INDETERMINATE is used to just avoid this deadlock. > > > > > The in-kernel fs avoid this situation through the use of GFP_NOFS > > > > allocations. The userspace fs can also use a similar approach which is > > > > prctl(PR_SET_IO_FLUSHER, 1) to avoid this situation. However I have been > > > > told that it is hard to use as it is per-thread flag and has to be set > > > > for all the threads handling writeback which can be error prone if the > > > > threadpool is dynamic. Second it is very coarse such that all the > > > > allocations from those threads (e.g. page faults) become NOFS which > > > > makes userspace very unreliable on highly utilized machine as NOFS can > > > > not reclaim potentially a lot of memory and can not trigger oom-kill. > > > > > > > > > > ... now I understand that we want to prevent a deadlock in one specific > > > scenario only? > > > > > > What sounds plausible for me is: > > > > > > a) Make this only affect the actual deadlock path: sync migration > > > during compaction. Communicate it either using some "context" > > > information or with a new MIGRATE_SYNC_COMPACTION. > > > b) Call it sth. like AS_WRITEBACK_MIGHT_DEADLOCK_ON_RECLAIM to express > > > that very deadlock problem. > > > c) Leave all others sync migration users alone for now > > > > The deadlock path is separate from sync migration. The deadlock arises > > from a corner case where cgroupv1 reclaim waits on a folio under > > writeback where that writeback itself is blocked on reclaim. > > > > Joanne, let's drop the patch to migrate.c completely and let's rename > the flag to something like what David is suggesting and only handle in > the reclaim path. > > > > > > > Would that prevent the deadlock? Even *better* would be to to be able to > > > ask the fs if starting writeback on a specific folio could deadlock. > > > Because in most cases, as I understand, we'll not actually run into the > > > deadlock and would just want to wait for writeback to just complete > > > (esp. compaction). > > > > > > (I still think having folios under writeback for a long time might be a > > > problem, but that's indeed something to sort out separately in the > > > future, because I suspect NFS has similar issues. We'd want to "wait > > > with timeout" and e.g., cancel writeback during memory > > > offlining/alloc_cma ...) > > Thanks David and yes let's handle the folios under writeback issue > separately. > > > > > I'm looking back at some of the discussions in v2 [1] and I'm still > > not clear on how memory fragmentation for non-movable pages differs > > from memory fragmentation from movable pages and whether one is worse > > than the other. > > I think the fragmentation due to movable pages becoming unmovable is > worse as that situation is unexpected and the kernel can waste a lot of > CPU to defrag the block containing those folios. For non-movable blocks, > the kernel will not even try to defrag. Now we can have a situation > where almost all memory is backed by non-movable blocks and higher order > allocations start failing even when there is enough free memory. For > such situations either system needs to be restarted (or workloads > restarted if they are cause of high non-movable memory) or the admin > needs to setup ZONE_MOVABLE where non-movable allocations don't go. Thanks for the explanations. The reason I ask is because I'm trying to figure out if having a time interval wait or retry mechanism instead of skipping migration would be a viable solution. Where when attempting the migration for folios with the as_writeback_indeterminate flag that are under writeback, it'll wait on folio writeback for a certain amount of time and then skip the migration if no progress has been made and the folio is still under writeback. there are two cases for fuse folios under writeback (for folios not under writeback, migration will work as is): a) normal case: server is not malicious or buggy, writeback is completed in a timely manner. For this case, migration would be successful and there'd be no difference for this between having no temp pages vs temp pages b) server is malicious or buggy: eg the server never completes writeback With no temp pages: The folio under writeback prevents a memory block (not sure how big this usually is?) from being compacted, leading to memory fragmentation With temp pages: fuse allocates a non-movable page for every page it needs to write back, which worsens memory usage, these pages will never get freed since the server never finishes writeback on them. The non-movable pages could also fragment memory blocks like in the scenario with no temp pages. Is the b) case with no temp pages worse for memory health than the scenario with temp pages? For the cpu usage issue (eg kernel keeps trying to defrag blocks containing these problematic folios), it seems like this could be potentially mitigated by marking these blocks as uncompactable? Thanks, Joanne > > > Currently fuse uses movable temp pages (allocated with > > gfp flags GFP_NOFS | __GFP_HIGHMEM), and these can run into the same > > issue where a buggy/malicious server may never complete writeback. > > So, these temp pages are not an issue for fragmenting the movable blocks > but if there is no limit on temp pages, the whole system can become > non-movable (there is a case where movable blocks on non-ZONE_MOVABLE > can be converted into non-movable blocks under low memory). ZONE_MOVABLE > will avoid such scenario but tuning the right size of ZONE_MOVABLE is > not easy. > > > This has the same effect of fragmenting memory and has a worse memory > > cost to the system in terms of memory used. With not having temp pages > > though, now in this scenario, pages allocated in a movable page block > > can't be compacted and that memory is fragmented. My (basic and maybe > > incorrect) understanding is that memory gets allocated through a buddy > > allocator and moveable vs nonmovable pages get allocated to > > corresponding blocks that match their type, but there's no other > > difference otherwise. Is this understanding correct? Or is there some > > substantial difference between fragmentation for movable vs nonmovable > > blocks? > > The main difference is the fallback of high order allocation which can > trigger compaction or background compaction through kcompactd. The > kernel will only try to defrag the movable blocks. >