On Wed, 14 Nov 2012, Anton Vorontsov wrote: > Thanks again for your inspirational comments! > Heh, not sure I've been too inspirational (probably more annoying than anything else). I really do want generic memory pressure notifications in the kernel and already have some ideas on how I can tie it into our malloc arenas, so please do keep working on it. > I think I understand what you're saying, and surely it makes sense, but I > don't know how you see this implemented on the API level. > > Getting struct {pid, pressure} pairs that cause the pressure at the > moment? And the monitor only gets <pids> that are in the same cpuset? How > about memcg limits?.. > Depends on whether you want to support mempolicies or not and the argument could go either way: - FOR supporting mempolicies: memory that you're mbind() too can become depleted and since there is no fallback then you have no way to prevent lots of reclaim and/or invoking the oom killer, it would be disappointing to not be able to get notifications of such a condition. - AGAINST supporting mempolicies: you only need to support memory isolation for cgroups (memcg and cpusets) and thus can implement your own memory pressure cgroup that you can use to aggregate tasks and then replace memcg memory thresholds with co-mounting this new cgroup that would notify on an eventfd anytime one of the attached processes experiences memory pressure. > > Most processes aren't going to care if they are running into memory > > pressure and have no implementation to free memory back to the kernel or > > start ratelimiting themselves. They will just continue happily along > > until they get the memory they want or they get oom killed. The ones that > > do, however, or a job scheduler or monitor that is watching over the > > memory usage of a set of tasks, will be able to do something when > > notified. > > Yup, this is exactly how we want to use this. In Android we have "Activity > Manager" thing, which acts exactly how you describe: it's a tasks monitor. > In addition to that, I think I can hook into our implementation of malloc which frees memory back to the kernel with MADV_DONTNEED and zaps individual ptes to poke holes in the memory it allocates to actually cache the memory that we free() and then re-use it under normal circumstances to return cache-hot memory on the next allocation but under memory pressure, as triggered by your interface (but for threads attached to a memcg facing memcg limits), drain the memory back to the kernel immediately. > > In the hopes of a single API that can do all this and not a > > reimplementation for various types of memory limitations (it seems like > > what you're suggesting is at least three different APIs: system-wide via > > vmpressure_fd, memcg via memcg thresholds, and cpusets through an eventual > > cpuset threshold), I'm hoping that we can have a single interface that can > > be polled on to determine when individual processes are encountering > > memory pressure. And if I'm not running in your oom cpuset, I don't care > > about your memory pressure. > > I'm not sure to what exactly you are opposing. :) You don't want to have > three "kinds" pressures, or you don't what to have three different > interfaces to each of them, or both? > The three pressures are a separate topic (I think it would be better to have some measure of memory pressure similar to your reclaim scale and allow users to get notifications at levels they define). I really dislike having multiple interfaces that are all different from one another depending on the context. Given what we have right now with memory thresholds in memcg, if we were to merge vmpressure_fd, then we're significantly limiting the usecase since applications need not know if they are attached to a memcg or not: it's a type of virtualization that the admin may setup but another admin may be running unconstrained on a system with much more memory. So for your usecase of a job monitor, that would work fine for global oom conditions but the application no longer has an API to use if it wants to know when it itself is feeling memory pressure. I think others have voiced their opinion on trying to create a single API for memory pressure notifications as well, it's just a hard problem and takes a lot of work to determine how we can make it easy to use and understand and extendable at the same time. > > I don't understand, how would this work with cpusets, for example, with > > vmpressure_fd as defined? The cpuset policy is embedded in the page > > allocator and skips over zones that are not allowed when trying to find a > > page of the specified order. Imagine a cpuset bound to a single node that > > is under severe memory pressure. The reclaim logic will get triggered and > > cause a notification on your fd when the rest of the system's nodes may > > have tons of memory available. > > Yes, I see your point: we have many ways to limit resources, so it makes > it hard to identify the cause of the "pressure" and thus how to deal with > it, since the pressure might be caused by different kinds of limits, and > freeing memory from one bucket doesn't mean that the memory will be > available to the process that is requesting the memory. > > So we do want to know whether a specific cpuset is under pressure, whether > a specific memcg is under pressure, or whether the system (and kernel > itself) lacks memory. > > And we want to have a single API for this? Heh. :) > Might not be too difficult if you implement your own cgroup to aggregate these tasks for which you want to know memory pressure events; it would have to be triggered for the task trying to allocate memory at any given time and how hard it was to allocate that memory in the slowpath, tie it back to that tasks' memory pressure cgroup, and then report the trigger if it's over a user-defined threshold normalized to the 0-100 scale. Then you could co-mount this cgroup with memcg, cpusets, or just do it for the root cgroup for users who want to monitor the entire system (CONFIG_CGROUPS is enabled by default). -- To unsubscribe from this list: send the line "unsubscribe linux-man" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html