Re: [RFC PATCH 0/2] memcg: add nomlock to avoid folios beling mlocked in a memcg

Michal Hocko <mhocko@xxxxxxxx> · Mon, 6 Jan 2025 13:28:24 +0100

On Sun 22-12-24 10:34:12, Yafang Shao wrote:
> On Sat, Dec 21, 2024 at 3:21 PM Michal Hocko <mhocko@xxxxxxxx> wrote:
> >
> > On Fri 20-12-24 19:52:16, Yafang Shao wrote:
> > > On Fri, Dec 20, 2024 at 6:23 PM Michal Hocko <mhocko@xxxxxxxx> wrote:
> > > >
> > > > On Sun 15-12-24 15:34:13, Yafang Shao wrote:
> > > > > Implementation Options
> > > > > ----------------------
> > > > >
> > > > > - Solution A: Allow file caches on the unevictable list to become
> > > > >   reclaimable.
> > > > >   This approach would require significant refactoring of the page reclaim
> > > > >   logic.
> > > > >
> > > > > - Solution B: Prevent file caches from being moved to the unevictable list
> > > > >   during mlock and ignore the VM_LOCKED flag during page reclaim.
> > > > >   This is a more straightforward solution and is the one we have chosen.
> > > > >   If the file caches are reclaimed from the download-proxy's memcg and
> > > > >   subsequently accessed by tasks in the application’s memcg, a filemap
> > > > >   fault will occur. A new file cache will be faulted in, charged to the
> > > > >   application’s memcg, and locked there.
> > > >
> > > > Both options are silently breaking userspace because a non failing mlock
> > > > doesn't give guarantees it is supposed to AFAICS.
> > >
> > > It does not bypass the mlock mechanism; rather, it defers the actual
> > > locking operation to the page fault path. Could you clarify what you
> > > mean by "a non-failing mlock"? From what I can see, mlock can indeed
> > > fail if there isn’t sufficient memory available. With this change, we
> > > are simply shifting the potential failure point to the page fault path
> > > instead.
> >
> > Your change will cause mlocked pages (as mlock syscall returns success)
> > to be reclaimable later on. That breaks the basic mlock contract.
> 
> AFAICS, the mlock() behavior was originally designed with only a
> single root memory cgroup in mind. In other words, when mlock() was
> introduced, all locked pages were confined to the same memcg.

yes and this is the case to any other syscalls that might have an impact
on the memory consumption. This is by design. Memory cgroup controller
aims to provide a completely transparent resource control without any
modifications to applications. This is the case for all other cgroup
controllers. If memcg (or other controller) affects a specific syscall
behavior then this has to be communicated explicitly to the caller.

The purpose of mlock syscall is to _guarantee_ memory to be resident
(never swapped out). There might be additional constrains to prevent
from mlock succeeding - e.g. rlimit or if memcg aims to control amount
of the mlocked memory but those failures need to be explicitly
communicated via syscall failure.

> However, this changed with the introduction of memcg support. Now,
> mlock() can lock pages that belong to a different memcg than the
> current task. This behavior is not explicitly defined in the mlock()
> documentation, which could lead to confusion.

This is more of a problem of the cgroup configurations where different
resource domains are sharing resources. This is not much diffent when
other resources (e.g. shmem) are shared accross unrelated cgroups.

> To clarify, I propose updating the mlock() documentation as follows:

This is not really possible because you are effectively breaking an
existing userspace.
-- 
Michal Hocko
SUSE Labs