On Mon, Sep 08, 2014 at 10:53:48PM +0900, Kamezawa Hiroyuki wrote: > (2014/09/08 20:01), Vladimir Davydov wrote: > >But OK, you don't like OOM on hitting anon+swap limit and propose to > >introduce a kind of userspace notification instead, but the problem > >actually isn't *WHAT* we should do on hitting anon+swap limit, but *HOW* > >we should implement it (or should we implement it at all). > > > I'm not sure you're aware of or not, "hardlimit" counter is too expensive > for your purpose. > > If I was you, I'll use some lightweight counter like percpu_counter() or > memcg's event handling system. > Did you see how threshold notifier or vmpressure works ? It's very light weight. OK, after looking through the memory thresholds code and pondering the problem a bit I tend to agree with you. We can tweak the notifiers to trigger on anon+swap thresholds, handle them in userspace and do whatever we like. At least for now, I don't see anything why this could be worse than hard anon+swap limit except it requires more steps to configure. Thank you for your patience while explaining this to me :-) However, there's one thing, which made me start this discussion, and it still bothers me. It's about memsw.limit_in_bytes knob itself. First, its value must be greater or equal to memory.limit_in_bytes. IMO, such a dependency in the user interface isn't great, but it isn't the worst thing. What is worse, there's only point in setting it to infinity if one wants to fully make use of soft limits as I pointed out earlier. So, we have a userspace knob that suits only for strict sand-boxing when one wants to hard-limit the amount of memory and swap an app can use. When it comes to soft limits, you have to set it to infinity, and it'll still be accounted at the cost of performance, but without any purpose. It just seems meaningless to me. Not counting that the knob itself is a kind of confusing IMO. memsw means memory+swap, so one would mistakenly think memsw.limit-mem.limit is the limit on swap usage, but that's wrong. My point is that anon+swap accounting instead of the current anon+file+swap memsw implementation would be more flexible. We could still sandbox apps by setting hard anon+swap and memory limits, but it would also be possible to make use of it in "soft" environments. It wouldn't be mandatory though. If one doesn't like OOM, he can use threshold notifications to restart the container when it starts to behave badly. But if the user just doesn't want to bother about configuration or is OK with OOM-killer, he could set hard anon+swap limit. Besides, it would untie mem.limit knob from memsw.limit, which would make the user interface simpler and cleaner. So, I think anon+swap limit would be more flexible than file+anon+swap limit we have now. Is there any use case where anon+swap and anon+file accounting couldn't satisfy the user requirements while the anon+file+swap and anon+file pair could? > >No matter which way we go, in-kernel OOM or userland notifications, we have to > >*INTRODUCE ANON+SWAP ACCOUNTING* to achieve that so that on breaching a > >predefined threshold we could invoke OOM or issue a userland > >notification or both. And here goes the problem: there's anon+file and > >anon+file+swap resource counters, but no anon+swap counter. To react on > >anon+swap limit breaching, we must introduce one. I propose to *REUSE* > >memsw instead by slightly modifying its meaning. > > > you can see "anon+swap" via memcg's accounting. > > >What we would get then is the ability to react on potentially > >unreclaimable memory growth inside a container. What we would loose is > >the current implementation of memory+swap limit, *BUT* we would still be > >able to limit memory+swap usage by imposing limits on total memory and > >anon+swap usage. > > > > I repeatedly say anon+swap "hardlimit" just means OOM. That's not buy. anon+file+swap hardlimit eventually means OOM too :-/ > >>And your idea can't help swap-out caused by memory pressure comes from "zones". > > > >It would help limit swap-out to a sane value. > > > > > >I'm sorry if I'm not clear or don't understand something that looks > >trivial to you. > > > > It seems your purpose is to avoiding system-wide-oom-situation. Right ? This is the purpose of any hard memory limit, including the current implementation - avoiding global memory pressure in general and system-wide OOM in particular. > Implementing system-wide-oom-kill-avoidance logic in memcg doesn't > sound good to me. It should work under system-wide memory management logic. > If memcg can be a help for it, it will be good. > > > For your purpose, you need to implement your method in system-wide way. > It seems crazy to set per-cgroup-anon-limit for avoding system-wide-oom. > You'll need help of system-wide-cgroup-configuration-middleware even if > you have a method in a cgroup. If you say logic should be in OS kernel, > please implement it in a system wide logic rather than cgroup. What if on global pressure a memory cgroup exceeding its soft limit is being reclaimed, but not fast enough, because it has a lot of anon memory? The global OOM won't be triggered then, because there's still progress, but the system will experience hard pressure due to the reclaimer runs. How can we detect if we should kill the container or not? It smells like one more heuristic to vmscan, IMO. Thanks, Vladimir -- To unsubscribe from this list: send the line "unsubscribe cgroups" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html