> > Memory controller implements the memory.low best-effort memory > protection mechanism, which works perfectly in many cases and > allows protecting working sets of important workloads from > sudden reclaim. > > But its semantics has a significant limitation: it works > only as long as there is a supply of reclaimable memory. > This makes it pretty useless against any sort of slow memory > leaks or memory usage increases. This is especially true > for swapless systems. If swap is enabled, memory soft protection > effectively postpones problems, allowing a leaking application > to fill all swap area, which makes no sense. > The only effective way to guarantee the memory protection > in this case is to invoke the OOM killer. > > It's possible to handle this case in userspace by reacting > on MEMCG_LOW events; but there is still a place for a fail-safe > in-kernel mechanism to provide stronger guarantees. > > This patch introduces the memory.min interface for cgroup v2 > memory controller. It works very similarly to memory.low > (sharing the same hierarchical behavior), except that it's > not disabled if there is no more reclaimable memory in the system. > > If cgroup is not populated, its memory.min is ignored, > because otherwise even the OOM killer wouldn't be able > to reclaim the protected memory, and the system can stall. > > ... > > --- a/Documentation/cgroup-v2.txt > +++ b/Documentation/cgroup-v2.txt > @@ -1002,6 +1002,29 @@ PAGE_SIZE multiple when read back. > The total amount of memory currently being used by the cgroup > and its descendants. > > + memory.min > + A read-write single value file which exists on non-root > + cgroups. The default is "0". > + > + Hard memory protection. If the memory usage of a cgroup > + is within its effective min boundary, the cgroup's memory > + won't be reclaimed under any conditions. If there is no > + unprotected reclaimable memory available, OOM killer > + is invoked. > + > + Effective low boundary is limited by memory.min values of > + all ancestor cgroups. If there is memory.min overcommitment > + (child cgroup or cgroups are requiring more protected memory > + than parent will allow), then each child cgroup will get > + the part of parent's protection proportional to its > + actual memory usage below memory.min. > + > + Putting more memory than generally available under this > + protection is discouraged and may lead to constant OOMs. > + > + If a memory cgroup is not populated with processes, > + its memory.min is ignored. This is a copy-paste-edit of the memory.low description. Could we please carefully check that it all remains accurate? Should "Effective low boundary" be "Effective min boundary"? Does overcommit still apply to .min? etcetera.