Re: [RFC v3 0/3] vmpressure_fd: Linux VM pressure notifications

David Rientjes <rientjes@xxxxxxxxxx> · Wed, 14 Nov 2012 19:59:52 -0800 (PST)

On Wed, 14 Nov 2012, Anton Vorontsov wrote:

> > I agree that eventfd is the way to go, but I'll also add that this feature 
> > seems to be implemented at a far too coarse of level.  Memory, and hence 
> > memory pressure, is constrained by several factors other than just the 
> > amount of physical RAM which vmpressure_fd is addressing.  What about 
> > memory pressure caused by cpusets or mempolicies?  (Memcg has its own 
> > reclaim logic
> 
> Yes, sure, and my plan for per-cgroups vmpressure was to just add the same
> hooks into cgroups reclaim logic (as far as I understand, we can use the
> same scanned/reclaimed ratio + reclaimer priority to determine the
> pressure).
> 

I don't understand, how would this work with cpusets, for example, with 
vmpressure_fd as defined?  The cpuset policy is embedded in the page 
allocator and skips over zones that are not allowed when trying to find a 
page of the specified order.  Imagine a cpuset bound to a single node that 
is under severe memory pressure.  The reclaim logic will get triggered and 
cause a notification on your fd when the rest of the system's nodes may 
have tons of memory available.  So now an application that actually is 
using this interface and is trying to be a good kernel citizen decides to 
free caches back to the kernel, start ratelimiting, etc, when it actually 
doesn't have any memory allocated on the nearly-oom cpuset so its memory 
freeing doesn't actually achieve anything.

Rather, I think it's much better to be notified when an individual process 
invokes various levels of reclaim up to and including the oom killer so 
that we know the context that memory freeing needs to happen (or, 
optionally, the set of processes that could be sacrificed so that this 
higher priority process may allocate memory).

> > and its own memory thresholds implemented on top of eventfd 
> > that people already use.)  These both cause high levels of reclaim within 
> > the page allocator whereas there may be an abundance of free memory 
> > available on the system.
> 
> Yes, surely global-level vmpressure should be separate for the per-cgroup
> memory pressure.
> 

I disagree, I think if you have a per-thread memory pressure notification 
if and when it starts down the page allocator slowpath, through the 
various states of reclaim (perhaps on a scale of 0-100 as described), and 
including the oom killer that you can target eventual memory freeing that 
actually is useful.

> But we still want the "global vmpressure" thing, so that we could use it
> without cgroups too. How to do it -- syscall or sysfs+eventfd doesn't
> matter much (in the sense that I can do eventfd thing if you folks like it
> :).
> 

Most processes aren't going to care if they are running into memory 
pressure and have no implementation to free memory back to the kernel or 
start ratelimiting themselves.  They will just continue happily along 
until they get the memory they want or they get oom killed.  The ones that 
do, however, or a job scheduler or monitor that is watching over the 
memory usage of a set of tasks, will be able to do something when 
notified.

In the hopes of a single API that can do all this and not a 
reimplementation for various types of memory limitations (it seems like 
what you're suggesting is at least three different APIs: system-wide via 
vmpressure_fd, memcg via memcg thresholds, and cpusets through an eventual 
cpuset threshold), I'm hoping that we can have a single interface that can 
be polled on to determine when individual processes are encountering 
memory pressure.  And if I'm not running in your oom cpuset, I don't care 
about your memory pressure.
--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html