Re: [RFC PATCH 0/2] memcg: add nomlock to avoid folios beling mlocked in a memcg

Yafang Shao <laoar.shao@xxxxxxxxx> · Wed, 25 Dec 2024 10:23:53 +0800

On Sun, Dec 22, 2024 at 10:34 AM Yafang Shao <laoar.shao@xxxxxxxxx> wrote:
>
> On Sat, Dec 21, 2024 at 3:21 PM Michal Hocko <mhocko@xxxxxxxx> wrote:
> >
> > On Fri 20-12-24 19:52:16, Yafang Shao wrote:
> > > On Fri, Dec 20, 2024 at 6:23 PM Michal Hocko <mhocko@xxxxxxxx> wrote:
> > > >
> > > > On Sun 15-12-24 15:34:13, Yafang Shao wrote:
> > > > > Implementation Options
> > > > > ----------------------
> > > > >
> > > > > - Solution A: Allow file caches on the unevictable list to become
> > > > >   reclaimable.
> > > > >   This approach would require significant refactoring of the page reclaim
> > > > >   logic.
> > > > >
> > > > > - Solution B: Prevent file caches from being moved to the unevictable list
> > > > >   during mlock and ignore the VM_LOCKED flag during page reclaim.
> > > > >   This is a more straightforward solution and is the one we have chosen.
> > > > >   If the file caches are reclaimed from the download-proxy's memcg and
> > > > >   subsequently accessed by tasks in the application’s memcg, a filemap
> > > > >   fault will occur. A new file cache will be faulted in, charged to the
> > > > >   application’s memcg, and locked there.
> > > >
> > > > Both options are silently breaking userspace because a non failing mlock
> > > > doesn't give guarantees it is supposed to AFAICS.
> > >
> > > It does not bypass the mlock mechanism; rather, it defers the actual
> > > locking operation to the page fault path. Could you clarify what you
> > > mean by "a non-failing mlock"? From what I can see, mlock can indeed
> > > fail if there isn’t sufficient memory available. With this change, we
> > > are simply shifting the potential failure point to the page fault path
> > > instead.
> >
> > Your change will cause mlocked pages (as mlock syscall returns success)
> > to be reclaimable later on. That breaks the basic mlock contract.
>
> AFAICS, the mlock() behavior was originally designed with only a
> single root memory cgroup in mind. In other words, when mlock() was
> introduced, all locked pages were confined to the same memcg.
>
> However, this changed with the introduction of memcg support. Now,
> mlock() can lock pages that belong to a different memcg than the
> current task. This behavior is not explicitly defined in the mlock()
> documentation, which could lead to confusion.
>
> To clarify, I propose updating the mlock() documentation as follows:
>
> When memcg is enabled, the page being locked might reside in a
> different memcg than the current task. In such cases, the page might
> be reclaimed if mlock() is not permitted in its original memcg. If the
> locked page is reclaimed, it could be faulted back into the current
> task's memcg and then locked again.
>
> Additionally, encountering a single page fault during this process
> should be acceptable to most users. If your application cannot
> tolerate even a single page fault, you likely wouldn’t enable memcg in
> the first place.
>

If you insist on not allowing a single page fault, there is an
alternative approach, though it may require significantly more complex
handling.

- Option C: Reparent the mlocked page to a common ancestor

Consider the following hierarchical:

         A
    /        \
  B           C

If B is mlocking a page in C, we can reparent that mlocked page to A,
essentially making A the new parent for the mlocked page.

                        A
                     /     \
                   B        C
                /     \         \
              D      E        F

In this example, if D is mlocking a page in F, we will reparent the
mlocked page to A.

- Benefits:
   No user-visible cgroup file setting: This approach avoids
introducing or modifying cgroup settings that could be visible or
configurable by users.

- Downsides:
  Increased complexity: This option requires significantly more work
in terms of managing the reparenting process.

-- 
Regards
Yafang