Hey, Tim. Sidenote: Please don't top-post with the whole body quoted below unless you're adding new cc's. Please selectively quote the original message's body to remind the readers of the context and reply below it. It's a basic lkml etiquette and one with good reasons. If you have to top-post for whatever reason - say you're typing from a machine which doesn't allow easy editing of the original message, explain so at the top of the message, or better yet, wait till you can unless it's urgent. On Wed, Dec 11, 2013 at 09:37:46PM -0800, Tim Hockin wrote: > The immediate problem I see with setting aside reserves "off the top" > is that we don't really know a priori how much memory the kernel > itself is going to use, which could still land us in an overcommitted > state. > > In other words, if I have your 128 MB machine, and I set aside 8 MB > for OOM handling, and give 120 MB for jobs, I have not accounted for > the kernel. So I set aside 8 MB for OOM and 100 MB for jobs, leaving > 20 MB for jobs. That should be enough right? Hell if I know, and > nothing ensures that. Yes, sure thing, that's the reason why I mentioned "with some slack" in the original message and also that it might not be completely the same. It doesn't allow you to aggressively use system level OOM handling as the sizing estimator for the root cgroup; however, it's more of an implementation details than something which should guide the overall architecture - it's a problem which lessens in severity as [k]memcg improves and its coverage becomes more complete, which is the direction we should be headed no matter what. It'd depend on the workload but with memcg fully configured it shouldn't fluctuate wildly. If it does, we need to hunt down whatever is causing such fluctuatation and include it in kmemcg, right? That way, memcg as a whole improves for all use cases not just your niche one and I strongly believe that aligning as many use cases as possible along the same axis, rather than creating a large hole to stow away the exceptions, is vastly more beneficial to *everyone* in the long term. There'd still be all the bells and whistles to configure and monitor system-level OOM and if there's justified need for improvements, we surely can and should do that; however, with the heavy lifting / hot path offloaded to the per-memcg userland OOM handlers, I believe it's reasonable to expect the burden on system OOM handler being noticeably less, which is the way it should be. That's the last guard against the whole system completely locking up and we can't extend its capabilities beyond that easily and we most likely don't even want to. If I take back a step and look at the two options and their pros and cons, which path we should take is rather obvious to me. I hope you see it too. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe cgroups" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html