On Wed, 5 Mar 2014, Andrew Morton wrote:
> > This patchset introduces a standard interface through memcg that allows
> > both of these conditions to be handled in the same clean way: users
> > define memory.oom_reserve_in_bytes to define the reserve and this
> > amount is allowed to be overcharged to the process handling the oom
> > condition's memcg. If used with the root memcg, this amount is allowed
> > to be allocated below the per-zone watermarks for root processes that
> > are handling such conditions (only root may write to
> > cgroup.event_control for the root memcg).
>
> If process A is trying to allocate memory, cannot do so and the
> userspace oom-killer is invoked, there must be means via which process
> A waits for the userspace oom-killer's action.
It does so by relooping in the page allocator waiting for memory to be
freed just like it would if the kernel oom killer were called and process
A was waiting for the oom kill victim process B to exit, we don't have the
ability to put it on a waitqueue because we don't touch the freeing
hotpath. The userspace oom handler may not even necessarily kill
anything, it may be able to free its own memory and start throttling other
processes, for example.
> And there must be
> fallbacks which occur if the userspace oom killer fails to clear the
> oom condition, or times out.
>
I agree completely and proposed this before as memory.oom_delay_millisecs
at http://lwn.net/Articles/432226 which we use internally when memory
can't be freed or a memcg's limit cannot be expanded. I guess it makes
more sense alongside the rest of this patchset now, I can add it as an
additional patch next time around.
> Would be interested to see a description of how all this works.
>
There's an article for LWN also being developed on this topic. As
mentioned in that article, I think it would be best to generalize a lot of
the common functions and the eventfd handling entirely into a library.
I've attached an example implementation that just invokes a function to
handle the situation.
For Google's usecase specifically, at the root memcg level (system oom) we
want to do priority based memcg killing. We want to kill from within a
memcg hierarchy that has the lowest priority relative to other memcgs.
This cannot be implemented with /proc/pid/oom_score_adj today. Those
priorities may also change depending on whether a memcg hierarchy is
"overlimit", i.e. its limit has been increased temporarily because it has
hit a memcg oom and additional memory is readily available on the system.
So why not just introduce a memcg tunable that specifies a priority?
Well, it's not that simple. Other users will want to implement different
policies on system oom (think about things like existing panic_on_oom or
oom_kill_allocating_task sysctls). I introduced oom_kill_allocating_task
originally for SGI because they wanted a fast oom kill rather than
expensive tasklist scan: the allocating task itself is rather irrelevant,
it was just the unlucky task that was allocating at the moment that oom
was triggered. What's guaranteed is that current in that case will always
free memory from under oom (it's not a member of some other mempolicy or
cpuset that would be needlessly killed). Both sysctls could trivially be
reimplemented in userspace with this feature.
I have other customers who don't run in a memcg environment at all, they
simply reattach all processes to root and delete all other memcgs. These
customers are only concerned about system oom conditions and want to do
something "interesting" before a process is killed. Some want to log the
VM statistics as an artifact to examine later, some want to examine heap
profiles, others can start throttling and freeing memory rather than kill
anything. All of this is impossible today because the kernel oom killer
will simply kill something immediately and any stats we collect afterwards
don't represent the oom condition. The heap profiles are lost, throttling
is useless, etc.
Jianguo (cc'd) may also have usecases not described here.
> It is unfortunate that this feature is memcg-only. Surely it could
> also be used by non-memcg setups. Would like to see at least a
> detailed description of how this will all be presented and implemented.
> We should aim to make the memcg and non-memcg userspace interfaces and
> user-visible behaviour as similar as possible.
>
It's memcg only because it can handle both system and memcg oom conditions
with the same clean interface, it would be possible to implement only
system oom condition handling through procfs (a little sloppy since it
needs to register the eventfd) but then a userspace oom handler would need
to determine which interface to use based on whether it was running in a
memcg or non-memcg environment. I implemented this feature with userspace
in mind: I didn't want it to need two different implementations to do the
same thing depending on memcg. The way it is written, a userspace oom
handler does not know (nor need not care) whether it is constrained by the
amount of system RAM or a memcg limit. It can simply write the reserve to
its memcg's memory.oom_reserve_in_bytes, attach to memory.oom_control and
be done.
This does mean that memcg needs to be enabled for the support, though.
This is already done on most distributions, the cgroup just needs to be
mounted. Would it be better to duplicate the interface in two different
spots depending on CONFIG_MEMCG? I didn't think so, and I think the idea
of a userspace library that takes care of this registration (and mounting,
perhaps) proposed on LWN would be the best of both worlds.
> Patches 1, 2, 3 and 5 appear to be independent and useful so I think
> I'll cherrypick those, OK?
>
Ok! I'm hoping that the PF_MEMPOLICY bit that is removed in those patches
is at least temporarily reserved for PF_OOM_HANDLER introduced here, I
removed it purposefully :)
/*
*
*/
#include <errno.h>
#include <fcntl.h>
#include <limits.h>
#include <stdio.h>
#include <string.h>
#include <sys/eventfd.h>
#include <sys/mman.h>
#include <sys/types.h>
#define STRING_MAX (512)
void handle_oom(void)
{
printf("notification received\n");
}
int wait_oom_notifier(int eventfd_fd, void (*handler)(void))
{
uint64_t ret;
int err;
for (;;) {
err = read(eventfd_fd, &ret, sizeof(ret));
if (err != sizeof(ret)) {
fprintf(stderr, "read()\n");
return err;
}
handler();
}
}
int register_oom_notifier(const char *memcg)
{
char path[PATH_MAX];
char control_string[STRING_MAX];
int event_control_fd;
int control_fd;
int eventfd_fd;
int err = 0;
err = snprintf(path, PATH_MAX, "%s/memory.oom_control", memcg);
if (err < 0) {
fprintf(stderr, "snprintf()\n");
goto out;
}
control_fd = open(path, O_RDONLY);
if (control_fd == -1) {
fprintf(stderr, "open(): %d\n", errno);
err = errno;
goto out;
}
eventfd_fd = eventfd(0, 0);
if (eventfd_fd == -1) {
fprintf(stderr, "eventfd(): %d\n", errno);
err = errno;
goto out_close_control;
}
err = snprintf(control_string, STRING_MAX, "%d %d", eventfd_fd,
control_fd);
if (err < 0) {
fprintf(stderr, "snprintf()\n");
goto out_close_eventfd;
}
err = snprintf(path, PATH_MAX, "%s/cgroup.event_control", memcg);
if (err < 0) {
fprintf(stderr, "snprintf()\n");
goto out_close_eventfd;
}
event_control_fd = open(path, O_WRONLY);
if (event_control_fd == 1) {
fprintf(stderr, "open(): %d\n", errno);
err = errno;
goto out_close_eventfd;
}
write(event_control_fd, control_string, strlen(control_string));
close(event_control_fd);
return eventfd_fd;
out_close_eventfd:
close(eventfd_fd);
out_close_control:
close(control_fd);
out:
return err;
}
int main(int argc, char **argv)
{
int eventfd_fd;
int err = 0;
if (argc != 2) {
fprintf(stderr, "usage: %s <path>\n", argv[0]);
return -1;
}
err = mlockall(MCL_FUTURE);
if (err) {
fprintf(stderr, "%d\n", errno);
return -1;
}
eventfd_fd = register_oom_notifier(argv[1]);
if (eventfd_fd < 0) {
fprintf(stderr, "%d\n", err);
goto out;
}
err = wait_oom_notifier(eventfd_fd, handle_oom);
if (err) {
fprintf(stderr, "wait_oom_notifier()\n");
goto out;
}
out:
munlockall();
return err;
}