Re: [RFC PATCH 0/2] memcg: add nomlock to avoid folios beling mlocked in a memcg

Yafang Shao <laoar.shao@xxxxxxxxx> · Sun, 22 Dec 2024 10:34:12 +0800

On Sat, Dec 21, 2024 at 3:21 PM Michal Hocko <mhocko@xxxxxxxx> wrote:
>
> On Fri 20-12-24 19:52:16, Yafang Shao wrote:
> > On Fri, Dec 20, 2024 at 6:23 PM Michal Hocko <mhocko@xxxxxxxx> wrote:
> > >
> > > On Sun 15-12-24 15:34:13, Yafang Shao wrote:
> > > > Implementation Options
> > > > ----------------------
> > > >
> > > > - Solution A: Allow file caches on the unevictable list to become
> > > >   reclaimable.
> > > >   This approach would require significant refactoring of the page reclaim
> > > >   logic.
> > > >
> > > > - Solution B: Prevent file caches from being moved to the unevictable list
> > > >   during mlock and ignore the VM_LOCKED flag during page reclaim.
> > > >   This is a more straightforward solution and is the one we have chosen.
> > > >   If the file caches are reclaimed from the download-proxy's memcg and
> > > >   subsequently accessed by tasks in the application’s memcg, a filemap
> > > >   fault will occur. A new file cache will be faulted in, charged to the
> > > >   application’s memcg, and locked there.
> > >
> > > Both options are silently breaking userspace because a non failing mlock
> > > doesn't give guarantees it is supposed to AFAICS.
> >
> > It does not bypass the mlock mechanism; rather, it defers the actual
> > locking operation to the page fault path. Could you clarify what you
> > mean by "a non-failing mlock"? From what I can see, mlock can indeed
> > fail if there isn’t sufficient memory available. With this change, we
> > are simply shifting the potential failure point to the page fault path
> > instead.
>
> Your change will cause mlocked pages (as mlock syscall returns success)
> to be reclaimable later on. That breaks the basic mlock contract.

AFAICS, the mlock() behavior was originally designed with only a
single root memory cgroup in mind. In other words, when mlock() was
introduced, all locked pages were confined to the same memcg.

However, this changed with the introduction of memcg support. Now,
mlock() can lock pages that belong to a different memcg than the
current task. This behavior is not explicitly defined in the mlock()
documentation, which could lead to confusion.

To clarify, I propose updating the mlock() documentation as follows:

When memcg is enabled, the page being locked might reside in a
different memcg than the current task. In such cases, the page might
be reclaimed if mlock() is not permitted in its original memcg. If the
locked page is reclaimed, it could be faulted back into the current
task's memcg and then locked again.

Additionally, encountering a single page fault during this process
should be acceptable to most users. If your application cannot
tolerate even a single page fault, you likely wouldn’t enable memcg in
the first place.

-- 
Regards
Yafang