On Sat, Dec 21, 2024 at 3:21 PM Michal Hocko <mhocko@xxxxxxxx> wrote: > > On Fri 20-12-24 19:52:16, Yafang Shao wrote: > > On Fri, Dec 20, 2024 at 6:23 PM Michal Hocko <mhocko@xxxxxxxx> wrote: > > > > > > On Sun 15-12-24 15:34:13, Yafang Shao wrote: > > > > Implementation Options > > > > ---------------------- > > > > > > > > - Solution A: Allow file caches on the unevictable list to become > > > > reclaimable. > > > > This approach would require significant refactoring of the page reclaim > > > > logic. > > > > > > > > - Solution B: Prevent file caches from being moved to the unevictable list > > > > during mlock and ignore the VM_LOCKED flag during page reclaim. > > > > This is a more straightforward solution and is the one we have chosen. > > > > If the file caches are reclaimed from the download-proxy's memcg and > > > > subsequently accessed by tasks in the application’s memcg, a filemap > > > > fault will occur. A new file cache will be faulted in, charged to the > > > > application’s memcg, and locked there. > > > > > > Both options are silently breaking userspace because a non failing mlock > > > doesn't give guarantees it is supposed to AFAICS. > > > > It does not bypass the mlock mechanism; rather, it defers the actual > > locking operation to the page fault path. Could you clarify what you > > mean by "a non-failing mlock"? From what I can see, mlock can indeed > > fail if there isn’t sufficient memory available. With this change, we > > are simply shifting the potential failure point to the page fault path > > instead. > > Your change will cause mlocked pages (as mlock syscall returns success) > to be reclaimable later on. That breaks the basic mlock contract. AFAICS, the mlock() behavior was originally designed with only a single root memory cgroup in mind. In other words, when mlock() was introduced, all locked pages were confined to the same memcg. However, this changed with the introduction of memcg support. Now, mlock() can lock pages that belong to a different memcg than the current task. This behavior is not explicitly defined in the mlock() documentation, which could lead to confusion. To clarify, I propose updating the mlock() documentation as follows: When memcg is enabled, the page being locked might reside in a different memcg than the current task. In such cases, the page might be reclaimed if mlock() is not permitted in its original memcg. If the locked page is reclaimed, it could be faulted back into the current task's memcg and then locked again. Additionally, encountering a single page fault during this process should be acceptable to most users. If your application cannot tolerate even a single page fault, you likely wouldn’t enable memcg in the first place. -- Regards Yafang