On Sun, Dec 22, 2024 at 10:34 AM Yafang Shao <laoar.shao@xxxxxxxxx> wrote: > > On Sat, Dec 21, 2024 at 3:21 PM Michal Hocko <mhocko@xxxxxxxx> wrote: > > > > On Fri 20-12-24 19:52:16, Yafang Shao wrote: > > > On Fri, Dec 20, 2024 at 6:23 PM Michal Hocko <mhocko@xxxxxxxx> wrote: > > > > > > > > On Sun 15-12-24 15:34:13, Yafang Shao wrote: > > > > > Implementation Options > > > > > ---------------------- > > > > > > > > > > - Solution A: Allow file caches on the unevictable list to become > > > > > reclaimable. > > > > > This approach would require significant refactoring of the page reclaim > > > > > logic. > > > > > > > > > > - Solution B: Prevent file caches from being moved to the unevictable list > > > > > during mlock and ignore the VM_LOCKED flag during page reclaim. > > > > > This is a more straightforward solution and is the one we have chosen. > > > > > If the file caches are reclaimed from the download-proxy's memcg and > > > > > subsequently accessed by tasks in the application’s memcg, a filemap > > > > > fault will occur. A new file cache will be faulted in, charged to the > > > > > application’s memcg, and locked there. > > > > > > > > Both options are silently breaking userspace because a non failing mlock > > > > doesn't give guarantees it is supposed to AFAICS. > > > > > > It does not bypass the mlock mechanism; rather, it defers the actual > > > locking operation to the page fault path. Could you clarify what you > > > mean by "a non-failing mlock"? From what I can see, mlock can indeed > > > fail if there isn’t sufficient memory available. With this change, we > > > are simply shifting the potential failure point to the page fault path > > > instead. > > > > Your change will cause mlocked pages (as mlock syscall returns success) > > to be reclaimable later on. That breaks the basic mlock contract. > > AFAICS, the mlock() behavior was originally designed with only a > single root memory cgroup in mind. In other words, when mlock() was > introduced, all locked pages were confined to the same memcg. > > However, this changed with the introduction of memcg support. Now, > mlock() can lock pages that belong to a different memcg than the > current task. This behavior is not explicitly defined in the mlock() > documentation, which could lead to confusion. > > To clarify, I propose updating the mlock() documentation as follows: > > When memcg is enabled, the page being locked might reside in a > different memcg than the current task. In such cases, the page might > be reclaimed if mlock() is not permitted in its original memcg. If the > locked page is reclaimed, it could be faulted back into the current > task's memcg and then locked again. > > Additionally, encountering a single page fault during this process > should be acceptable to most users. If your application cannot > tolerate even a single page fault, you likely wouldn’t enable memcg in > the first place. > If you insist on not allowing a single page fault, there is an alternative approach, though it may require significantly more complex handling. - Option C: Reparent the mlocked page to a common ancestor Consider the following hierarchical: A / \ B C If B is mlocking a page in C, we can reparent that mlocked page to A, essentially making A the new parent for the mlocked page. A / \ B C / \ \ D E F In this example, if D is mlocking a page in F, we will reparent the mlocked page to A. - Benefits: No user-visible cgroup file setting: This approach avoids introducing or modifying cgroup settings that could be visible or configurable by users. - Downsides: Increased complexity: This option requires significantly more work in terms of managing the reparenting process. -- Regards Yafang