On Tue, 4 Mar 2014 19:58:38 -0800 (PST) David Rientjes <rientjes@xxxxxxxxxx> wrote: > This patchset implements userspace out of memory handling. > > It is based on v3.14-rc5. Individual patches will apply cleanly or you > may pull the entire series from > > git://git.kernel.org/pub/scm/linux/kernel/git/rientjes/linux.git mm/oom > > When the system or a memcg is oom, processes running on that system or > attached to that memcg cannot allocate memory. It is impossible for a > process to reliably handle the oom condition from userspace. > > First, consider only system oom conditions. When memory is completely > depleted and nothing may be reclaimed, the kernel is forced to free some > memory; the only way it can do so is to kill a userspace process. This > will happen instantaneously and userspace can enforce neither its own > policy nor collect information. > > On system oom, there may be a hierarchy of memcgs that represent user > jobs, for example. Each job may have a priority independent of their > current memory usage. There is no existing kernel interface to kill the > lowest priority job; userspace can now kill the lowest priority job or > allow priorities to change based on whether the job is using more memory > than its pre-defined reservation. > > Additionally, users may want to log the condition or debug applications > that are using too much memory. They may wish to collect heap profiles > or are able to do memory freeing without killing a process by throttling > or ratelimiting. > > Interactive users using X window environments may wish to have a dialogue > box appear to determine how to proceed -- it may even allow them shell > access to examine the state of the system while oom. > > It's not sufficient to simply restrict all user processes to a subset of > memory and oom handling processes to the remainder via a memcg hierarchy: > kernel memory and other page allocations can easily deplete all memory > that is not charged to a user hierarchy of memory. > > This patchset allows userspace to do all of these things by defining a > small memory reserve that is accessible only by processes that are > handling the notification. > > Second, consider memcg oom conditions. Processes need no special > knowledge of whether they are attached to the root memcg, where memcg > charging will always succeed, or a child memcg where charging will fail > when the limit has been reached. This allows those processes handling > memcg oom conditions to overcharge the memcg by the amount of reserved > memory. They need not create child memcgs with smaller limits and > attach the userspace oom handler only to the parent; such support would > not allow userspace to handle system oom conditions anyway. > > This patchset introduces a standard interface through memcg that allows > both of these conditions to be handled in the same clean way: users > define memory.oom_reserve_in_bytes to define the reserve and this > amount is allowed to be overcharged to the process handling the oom > condition's memcg. If used with the root memcg, this amount is allowed > to be allocated below the per-zone watermarks for root processes that > are handling such conditions (only root may write to > cgroup.event_control for the root memcg). If process A is trying to allocate memory, cannot do so and the userspace oom-killer is invoked, there must be means via which process A waits for the userspace oom-killer's action. And there must be fallbacks which occur if the userspace oom killer fails to clear the oom condition, or times out. Would be interested to see a description of how all this works. It is unfortunate that this feature is memcg-only. Surely it could also be used by non-memcg setups. Would like to see at least a detailed description of how this will all be presented and implemented. We should aim to make the memcg and non-memcg userspace interfaces and user-visible behaviour as similar as possible. Patches 1, 2, 3 and 5 appear to be independent and useful so I think I'll cherrypick those, OK? -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html