On Thu Jan 2, 2025 at 2:59 PM EST, Joanne Koong wrote: > On Mon, Dec 30, 2024 at 12:04 PM Shakeel Butt <shakeel.butt@xxxxxxxxx> wrote: > > > > On Mon, Dec 30, 2024 at 10:38:16AM -0800, Joanne Koong wrote: > > > On Mon, Dec 30, 2024 at 2:16 AM David Hildenbrand <david@xxxxxxxxxx> wrote: > > > > Thanks David for the response. > > > > > > > > > > >> BTW, I just looked at NFS out of interest, in particular > > > > >> nfs_page_async_flush(), and I spot some logic about re-dirtying pages + > > > > >> canceling writeback. IIUC, there are default timeouts for UDP and TCP, > > > > >> whereby the TCP default one seems to be around 60s (* retrans?), and the > > > > >> privileged user that mounts it can set higher ones. I guess one could run > > > > >> into similar writeback issues? > > > > > > > > > > > > > Hi, > > > > > > > > sorry for the late reply. > > > > > > > > > Yes, I think so. > > > > > > > > > >> > > > > >> So I wonder why we never required AS_WRITEBACK_INDETERMINATE for nfs? > > > > > > > > > > I feel like INDETERMINATE in the name is the main cause of confusion. > > > > > > > > We are adding logic that says "unconditionally, never wait on writeback > > > > for these folios, not even any sync migration". That's the main problem > > > > I have. > > > > > > > > Your explanation below is helpful. Because ... > > > > > > > > > So, let me explain why it is required (but later I will tell you how it > > > > > can be avoided). The FUSE thread which is actively handling writeback of > > > > > a given folio can cause memory allocation either through syscall or page > > > > > fault. That memory allocation can trigger global reclaim synchronously > > > > > and in cgroup-v1, that FUSE thread can wait on the writeback on the same > > > > > folio whose writeback it is supposed to end and cauing a deadlock. So, > > > > > AS_WRITEBACK_INDETERMINATE is used to just avoid this deadlock. > > > > > > The in-kernel fs avoid this situation through the use of GFP_NOFS > > > > > allocations. The userspace fs can also use a similar approach which is > > > > > prctl(PR_SET_IO_FLUSHER, 1) to avoid this situation. However I have been > > > > > told that it is hard to use as it is per-thread flag and has to be set > > > > > for all the threads handling writeback which can be error prone if the > > > > > threadpool is dynamic. Second it is very coarse such that all the > > > > > allocations from those threads (e.g. page faults) become NOFS which > > > > > makes userspace very unreliable on highly utilized machine as NOFS can > > > > > not reclaim potentially a lot of memory and can not trigger oom-kill. > > > > > > > > > > > > > ... now I understand that we want to prevent a deadlock in one specific > > > > scenario only? > > > > > > > > What sounds plausible for me is: > > > > > > > > a) Make this only affect the actual deadlock path: sync migration > > > > during compaction. Communicate it either using some "context" > > > > information or with a new MIGRATE_SYNC_COMPACTION. > > > > b) Call it sth. like AS_WRITEBACK_MIGHT_DEADLOCK_ON_RECLAIM to express > > > > that very deadlock problem. > > > > c) Leave all others sync migration users alone for now > > > > > > The deadlock path is separate from sync migration. The deadlock arises > > > from a corner case where cgroupv1 reclaim waits on a folio under > > > writeback where that writeback itself is blocked on reclaim. > > > > > > > Joanne, let's drop the patch to migrate.c completely and let's rename > > the flag to something like what David is suggesting and only handle in > > the reclaim path. > > > > > > > > > > Would that prevent the deadlock? Even *better* would be to to be able to > > > > ask the fs if starting writeback on a specific folio could deadlock. > > > > Because in most cases, as I understand, we'll not actually run into the > > > > deadlock and would just want to wait for writeback to just complete > > > > (esp. compaction). > > > > > > > > (I still think having folios under writeback for a long time might be a > > > > problem, but that's indeed something to sort out separately in the > > > > future, because I suspect NFS has similar issues. We'd want to "wait > > > > with timeout" and e.g., cancel writeback during memory > > > > offlining/alloc_cma ...) > > > > Thanks David and yes let's handle the folios under writeback issue > > separately. > > > > > > > > I'm looking back at some of the discussions in v2 [1] and I'm still > > > not clear on how memory fragmentation for non-movable pages differs > > > from memory fragmentation from movable pages and whether one is worse > > > than the other. > > > > I think the fragmentation due to movable pages becoming unmovable is > > worse as that situation is unexpected and the kernel can waste a lot of > > CPU to defrag the block containing those folios. For non-movable blocks, > > the kernel will not even try to defrag. Now we can have a situation > > where almost all memory is backed by non-movable blocks and higher order > > allocations start failing even when there is enough free memory. For > > such situations either system needs to be restarted (or workloads > > restarted if they are cause of high non-movable memory) or the admin > > needs to setup ZONE_MOVABLE where non-movable allocations don't go. > > Thanks for the explanations. > > The reason I ask is because I'm trying to figure out if having a time > interval wait or retry mechanism instead of skipping migration would > be a viable solution. Where when attempting the migration for folios > with the as_writeback_indeterminate flag that are under writeback, > it'll wait on folio writeback for a certain amount of time and then > skip the migration if no progress has been made and the folio is still > under writeback. > > there are two cases for fuse folios under writeback (for folios not > under writeback, migration will work as is): > a) normal case: server is not malicious or buggy, writeback is > completed in a timely manner. > For this case, migration would be successful and there'd be no > difference for this between having no temp pages vs temp pages > > > b) server is malicious or buggy: > eg the server never completes writeback > > With no temp pages: > The folio under writeback prevents a memory block (not sure how big > this usually is?) from being compacted, leading to memory > fragmentation It is called pageblock. Its size is usually the same as a PMD THP (e.g., 2MB on x86_64). With no temp pages, folios can spread across multiple pageblocks, fragmenting all of them. > > With temp pages: > fuse allocates a non-movable page for every page it needs to write > back, which worsens memory usage, these pages will never get freed > since the server never finishes writeback on them. The non-movable > pages could also fragment memory blocks like in the scenario with no > temp pages. Since the temp pages are all coming from MIGRATE_UNMOVABLE pageblocks, which are much fewer, the fragmentation is much limited. > > > Is the b) case with no temp pages worse for memory health than the > scenario with temp pages? For the cpu usage issue (eg kernel keeps > trying to defrag blocks containing these problematic folios), it seems > like this could be potentially mitigated by marking these blocks as > uncompactable? With no temp pages, folios under writeback can potentially fragment more, if not all, pageblocks, compared to with temp pages, because MIGRATE_UNMOVABLE pageblocks are used for unmovable page allocations, like kernel data allocations, and are supposed to be much fewer than MIGRATE_MOVABLE pageblocks in the system. > > > Thanks, > Joanne > > > > > > Currently fuse uses movable temp pages (allocated with > > > gfp flags GFP_NOFS | __GFP_HIGHMEM), and these can run into the same > > > issue where a buggy/malicious server may never complete writeback. > > > > So, these temp pages are not an issue for fragmenting the movable blocks > > but if there is no limit on temp pages, the whole system can become > > non-movable (there is a case where movable blocks on non-ZONE_MOVABLE > > can be converted into non-movable blocks under low memory). ZONE_MOVABLE > > will avoid such scenario but tuning the right size of ZONE_MOVABLE is > > not easy. > > > > > This has the same effect of fragmenting memory and has a worse memory > > > cost to the system in terms of memory used. With not having temp pages > > > though, now in this scenario, pages allocated in a movable page block > > > can't be compacted and that memory is fragmented. My (basic and maybe > > > incorrect) understanding is that memory gets allocated through a buddy > > > allocator and moveable vs nonmovable pages get allocated to > > > corresponding blocks that match their type, but there's no other > > > difference otherwise. Is this understanding correct? Or is there some > > > substantial difference between fragmentation for movable vs nonmovable > > > blocks? > > > > The main difference is the fallback of high order allocation which can > > trigger compaction or background compaction through kcompactd. The > > kernel will only try to defrag the movable blocks. > > -- Best Regards, Yan, Zi