On Tue, 3 Aug 2010, KAMEZAWA Hiroyuki wrote: > > > Then, an applications' oom_score on a host is quite different from on the other > > > host. This operation is very new rather than a simple interface updates. > > > This opinion was rejected. > > > > > > > It wasn't rejected, I responded to your comment and you never wrote back. > > The idea > > > I just got tired to write the same thing in many times. And I don't have > strong opinions. I _know_ your patch fixes X-server problem. That was enough > for me. > There're a couple of reasons why I disagree that oom_score_adj should have memory quantity units. First, individual oom scores that come out of oom_badness() don't mean anything in isolation, they only mean something when compared to other candidate tasks. All applications, whether attached to a cpuset, a mempolicy, a memcg, or not, have an allowed set of memory and applications that are competing for those shared resources. When defining what application happens to be the most memory hogging, which is the one we want to kill, they are ranked amongst themselves. Using oom_score_adj as a proportion, we can say a particular application should be allowed 25% of resources, other applications should be allowed 5%, and others should be penalized 10%, for example. This makes prioritization for oom kill rather simple. Second, we don't want to adjust oom_score_adj anytime a task is attached to a cpuset, a mempolicy, or a memcg, or whenever those cpuset's mems changes, the bound mempolicy nodemask changes, or the memcg limit changes. The application need not know what that set of allowed memory is and the kernel should operate seemlessly regardless of what the attachment is. These are, in a sense, "virtualized" systems unto themselves: if a task is moved from a child cpuset to the root cpuset, it's set of allowed memory may become much larger. That action shouldn't need to have an equivalent change to /proc/pid/oom_score_adj: the priority of the task relative to its other competing tasks is the same. That set of allowed memory may change, but its priority does not unless explicitly changed by the admin. > > That would work if you want to setup individual memcgs for every > > application on your system, know what sane limits are for each one, and > > want to incur the significant memory expense of enabling > > CONFIG_CGROUP_MEM_RES_CTLR for its metadata. > > > Usual disto alreay enables it. > Yes, I'm well aware of my 40MB of lost memory on my laptop :) > Simply puts all applications to a group and disable oom and set oom_notifier. > Then, > - a "pop-up window" of task list will ask the user "which one do you want to kill ?" > - send a packet to ask a administlation server system "which one is killable ?" > or "increase memory limit" or "memory hot-add ?" > Having user interaction at the time of oom would certainly be nice, but is certainly impractical for us. So we need some way to state the relative importance of a task to the kernel so that it can act on our behalf when we encounter such a condition. I believe oom_score_adj does that quite effectively. > Possible case will be > - send SIGSTOP to all apps at OOM. > - rise limit to some extent. or move a killable one to a special group. > - wake up a killable one with SIGCONT. > - send SIGHUP to stop it safely. > We use oom notifiers with cpusets, which in this case can be used identically to how you're imagining memcg can be used. This particular change, however, only affects the oom killer: that is, it's only scope is that when the kernel can't do anything else, no userspace notifier is attached, and no memory freeing is going to otherwise occur. I would love to see a per-cgroup oom notifier to allow userspace to respond to these conditions in more effective ways, but I still believe there is a general need for a simple and predictable oom killer heuristic that the user has full power over. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxxx For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>