Hi David, Thanks again for your inspirational comments! On Wed, Nov 14, 2012 at 07:59:52PM -0800, David Rientjes wrote: > > > I agree that eventfd is the way to go, but I'll also add that this feature > > > seems to be implemented at a far too coarse of level. Memory, and hence > > > memory pressure, is constrained by several factors other than just the > > > amount of physical RAM which vmpressure_fd is addressing. What about > > > memory pressure caused by cpusets or mempolicies? (Memcg has its own > > > reclaim logic > > > > Yes, sure, and my plan for per-cgroups vmpressure was to just add the same > > hooks into cgroups reclaim logic (as far as I understand, we can use the > > same scanned/reclaimed ratio + reclaimer priority to determine the > > pressure). [Answers reordered] > Rather, I think it's much better to be notified when an individual process > invokes various levels of reclaim up to and including the oom killer so > that we know the context that memory freeing needs to happen (or, > optionally, the set of processes that could be sacrificed so that this > higher priority process may allocate memory). I think I understand what you're saying, and surely it makes sense, but I don't know how you see this implemented on the API level. Getting struct {pid, pressure} pairs that cause the pressure at the moment? And the monitor only gets <pids> that are in the same cpuset? How about memcg limits?.. [...] > > But we still want the "global vmpressure" thing, so that we could use it > > without cgroups too. How to do it -- syscall or sysfs+eventfd doesn't > > matter much (in the sense that I can do eventfd thing if you folks like it > > :). > > > > Most processes aren't going to care if they are running into memory > pressure and have no implementation to free memory back to the kernel or > start ratelimiting themselves. They will just continue happily along > until they get the memory they want or they get oom killed. The ones that > do, however, or a job scheduler or monitor that is watching over the > memory usage of a set of tasks, will be able to do something when > notified. Yup, this is exactly how we want to use this. In Android we have "Activity Manager" thing, which acts exactly how you describe: it's a tasks monitor. > In the hopes of a single API that can do all this and not a > reimplementation for various types of memory limitations (it seems like > what you're suggesting is at least three different APIs: system-wide via > vmpressure_fd, memcg via memcg thresholds, and cpusets through an eventual > cpuset threshold), I'm hoping that we can have a single interface that can > be polled on to determine when individual processes are encountering > memory pressure. And if I'm not running in your oom cpuset, I don't care > about your memory pressure. I'm not sure to what exactly you are opposing. :) You don't want to have three "kinds" pressures, or you don't what to have three different interfaces to each of them, or both? > I don't understand, how would this work with cpusets, for example, with > vmpressure_fd as defined? The cpuset policy is embedded in the page > allocator and skips over zones that are not allowed when trying to find a > page of the specified order. Imagine a cpuset bound to a single node that > is under severe memory pressure. The reclaim logic will get triggered and > cause a notification on your fd when the rest of the system's nodes may > have tons of memory available. Yes, I see your point: we have many ways to limit resources, so it makes it hard to identify the cause of the "pressure" and thus how to deal with it, since the pressure might be caused by different kinds of limits, and freeing memory from one bucket doesn't mean that the memory will be available to the process that is requesting the memory. So we do want to know whether a specific cpuset is under pressure, whether a specific memcg is under pressure, or whether the system (and kernel itself) lacks memory. And we want to have a single API for this? Heh. :) The other idea might be this (I'm describing it in detail so that you could actually comment on what exactly you don't like in this): 1. Obtain the fd via eventfd(); 2. The fd can be passed to these files: I) Say /sys/kernel/mm/memory_pressure If we don't use cpusets/memcg or even have CGROUPS=n, this will be system's/global memory pressure. Pass the fd to this file and start polling. If we do use cpusets or memcg, the API will still work, but we have two options for its behaviour: a) This will only report the pressure when we're reclaiming with say (global_reclaim() && node_isset(zone_to_nid(zone), current->mems_allowed)) == 1. (Basically, we want to see pressure of kernel slabs allocations or any non-soft limits). or b) If 'filtering' cpusets/memcg seems too hard, we can say that these notifications are the "sum" of global+memcg+cpuset. It doesn't make sense to actually monitor these, though, so if the monitor is aware of cgroups, just 'goto II) and/or III)'. II) /sys/fs/cgroup/cpuset/.../cpuset.memory_pressure (yeah, we have it already) Pass the fd to this file to monitor per-cpuset pressure. So, if you get the pressure from here, it makes sense to free resources from this cpuset. III) /sys/fs/cgroup/memory/.../memory.pressure Pass the fd to this file to monitor per-memcg pressure. If you get the pressure from here, it only makes sense to free resources from this memcg. 3. The pressure level values (and their meaning) and the format of the files are the same, and this what defines the "API". So, if "memory monitor/supervisor app" is aware of cpusets, it manages memory at this level. If both cpuset and memcg is used, then it has to monitor both files, and act accordingly. And if we don't use cpusets/memcg (or even have cgroups=n), we can just watch the global reclaimer's pressure. Do I understand correctly that you don't like this? Just to make sure. :) Thanks, Anton. -- To unsubscribe from this list: send the line "unsubscribe linux-man" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html