Re: [RFC] mm: memcg: add priority for soft limit reclaiming

Hillf Danton <hdanton@xxxxxxxx> · Tue, 24 Sep 2019 15:36:42 +0800

On Mon, 23 Sep 2019 21:28:34 Michal Hocko wrote:
> 
> On Mon 23-09-19 21:04:59, Hillf Danton wrote:
> >
> > On Thu, 19 Sep 2019 21:32:31 +0800 Michal Hocko wrote:
> > >
> > > On Thu 19-09-19 21:13:32, Hillf Danton wrote:
> > > >
> > > > Currently memory controler is playing increasingly important role in
> > > > how memory is used and how pages are reclaimed on memory pressure.
> > > >
> > > > In daily works memcg is often created for critical tasks and their pre
> > > > configured memory usage is supposed to be met even on memory pressure.
> > > > Administrator wants to make it configurable that the pages consumed by
> > > > memcg-B can be reclaimed by page allocations invoked not by memcg-A but
> > > > by memcg-C.
> > >
> > > I am not really sure I understand the usecase well but this sounds like
> > > what memory reclaim protection in v2 is aiming at.
> > >
> Please describe the usecase.
> 
It is for quite a while that task-A has been able to preempt task-B for
cpu cycles. IOW the physical resource cpu cycles are preemptible.

Are physical pages are preemptible too in the same manner?
Nope without priority defined for pages currently (say the link between
page->nice and task->nice).

The slrp is added for memcg instead of nice because 1) it is only used
in the page reclaiming context (in memcg it is soft limit reclaiming),
and 2) it is difficult to compare reclaimer and reclaimee task->nice
directly in that context as only info about reclaimer and lru page is
available.

Here task->nice is replaced with memcg->slrp in order to do page
preemption, PP. There is no way for task-A to PP task-B, but the
group containing task-A can PP the group containing task-B.
That preemption needs code within 100 lines as you see on top of
the current memory controller framework.

The user visible things following PP include
1) the increase in system-wide configurability,

Combined with and/or in parallel to memcg.high, PP help admin configure
and maintain 100 mm groups on systems with 100GB RAM. With every group
high bundary set to 10MB, then he only needs to fiddle with the slrps of
handful of groups containing critical tasks.

2) the increase in system-wide responsibility,

Because critical groups can be configured to be not page preempted.

3) the gradient field grows in a long running system with prioirty,

Just like the rivers going through all the ways from mountains to
the seas.

Adding PP in background reclaiming is on the way:
1> define page->nice and link it to task->nice
2> on isolating lru pages check reclaimer->nice against page->nice
   and skip page if reclaimer is lower on priority

> > A tipoint to the v2 stuff please.
> 
> Documentation/admin-guide/cgroup-v2.rst
> 
Thanks Michal.

Out of surprise slrp happened to go with the line of cgroup-v2.

--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1108,6 +1108,17 @@ PAGE_SIZE multiple when read back.
        Going over the high limit never invokes the OOM killer and
        under extreme conditions the limit may be breached.

+  memory.slrp
+       A read-write single value [0-32] file which exists on non-root
+       cgroups.  The default is "0".
+
+       Soft limit reclaiming priority.  This is the mechanism to control
+       how physical pages are reclaimed when a group's memory usage goes
+       over its high boundary.
+
+       It makes sure that no pages will be reclaimed from any group of
+       higher slrp in favor of a lower-slrp group.
+
   memory.max
        A read-write single value file which exists on non-root
        cgroups.  The default is "max".
--

Hillf