Hello, Tim. On Thu, Dec 12, 2013 at 10:42:20AM -0800, Tim Hockin wrote: > Yeah sorry. Replying from my phone is awkward at best. I know better :) Heh, sorry about being bitchy. :) > In my mind, the ONLY point of pulling system-OOM handling into > userspace is to make it easier for crazy people (Google) to implement > bizarre system-OOM policies. Example: I think that's one of the places where we largely disagree. If at all possible, I'd much prefer google's workload to be supported inside the general boundaries of the upstream kernel without having to punch a large hole in it. To me, the general development history of memcg in general and this thread in particular seem to epitomize why it is a bad idea to have isolated, large and deep "crazy" use cases. Punching the initial hole is the easy part; however, we all are quite limited in anticpating future needs and sooner or later that crazy use case is bound to evolve further towards the isolated extreme it departed towards and require more and larger holes and further contortions to accomodate such progress. The concern I have with the suggested solution is not necessarily that it's technically complex than it looks like on the surface - I'm sure it can be made to work one way or the other - but that it's a fairly large step toward an isolated extreme which memcg as a project probably should not head toward. There sure are cases where such exceptions can't be avoided and are good trade-offs but, here, we're talking about a major architectural decision which not only affects memcg but mm in general. I'm afraid this doesn't sound like an no-brainer flexibility we can afford. > When we have a system OOM we want to do a walk of the administrative > memcg tree (which is only a couple levels deep, users can make > non-admin sub-memcgs), selecting the lowest priority entity at each > step (where both tasks and memcgs have a priority and the priority > range is much wider than the current OOM scores, and where memcg > priority is sometimes a function of memcg usage), until we reach a > leaf. > > Once we reach a leaf, I want to log some info about the memcg doing > the allocation, the memcg being terminated, and maybe some other bits > about the system (depending on the priority of the selected victim, > this may or may not be an "acceptable" situation). Then I want to > kill *everything* under that memcg. Then I want to "publish" some > information through a sane API (e.g. not dmesg scraping). > > This is basically our policy as we understand it today. This is > notably different than it was a year ago, and it will probably evolve > further in the next year. I think per-memcg score and killing is something which makes fundamental sense. In fact, killing a single process has never made much sense to me as that is a unit which ultimately is only meaningful to the kernel itself and not necessraily to userland, so no matter what I think we're gonna gain per-memcg behavior and it seems most, albeit not all, of what you described above should be implementable through that. Ultimately, if the use case calls for very fine level of control, I think the right thing to do is making nesting work properly which is likely to take some time. In the meantime, even if such use case requires modifying the kernel to tailor the OOM behavior, I think sticking to kernel OOM provides a lot easier way to eventual convergence. Userland system OOM basically means giving up and would lessen the motivation towards improving the shared infrastructures while adding significant pressure towards schizophreic diversion. > We have a long tail of kernel memory usage. If we provision machines > so that the "do work here" first-level memcg excludes the average > kernel usage, we have a huge number of machines that will fail to > apply OOM policy because of actual overcommitment. If we provision > for 95th or 99th percentile kernel usage, we're wasting large amounts > of memory that could be used to schedule jobs. This is the > fundamental problem we face with static apportionment (and we face it > in a dozen other situations, too). Expressing this set-aside memory > as "off-the-top" rather than absolute limits makes the whole system > more flexible. I agree that's pretty sad. Maybe I shouldn't be surprised given the far-from-perfect coverage of kmemcg at this point, but, again, *everyone* wants [k]memcg coverage to be more complete and we have and are still building the infrastructures to make that possible, so I'm still of the opinion that making [k]memcg work better is the better direction to pursue and given the short development history of kmemcg I'm fairly sure there are quite a few low hanging fruits. Another thing which *might* be relevant is the rigidity of the upper limit and the vagueness of soft limit of the current implementation. I have a rather strong suspicion that the way memcg config knobs behave now - one finicky, the other whatever - is likely hindering the use cases to fan out more naturally. I could be completely wrong on this but your mention of inflexibility of absolute limits reminds me of the issue. Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>