On Thu, Jan 30 2014, Michal Hocko wrote: > On Wed 29-01-14 11:08:46, Greg Thelen wrote: > [...] >> The series looks useful. We (Google) have been using something similar. >> In practice such a low_limit (or memory guarantee), doesn't nest very >> well. >> >> Example: >> - parent_memcg: limit 500, low_limit 500, usage 500 >> 1 privately charged non-reclaimable page (e.g. mlock, slab) >> - child_memcg: limit 500, low_limit 500, usage 499 > > I am not sure this is a good example. Your setup basically say that no > single page should be reclaimed. I can imagine this might be useful in > some cases and I would like to allow it but it sounds too extreme (e.g. > a load which would start trashing heavily once the reclaim starts and it > makes more sense to start it again rather than crowl - think about some > mathematical simulation which might diverge). Pages will still be reclaimed the usage_in_bytes is exceeds limit_in_bytes. I see the low_limit as a way to tell the kernel: don't reclaim my memory due to external pressure, but internal pressure is different. >> If a streaming file cache workload (e.g. sha1sum) starts gobbling up >> page cache it will lead to an oom kill instead of reclaiming. > > Does it make any sense to protect all of such memory although it is > easily reclaimable? I think protection makes sense in this case. If I know my workload needs 500 to operate well, then I reserve 500 using low_limit. My app doesn't want to run with less than its reservation. >> One could argue that this is working as intended because child_memcg >> was promised 500 but can only get 499. So child_memcg is oom killed >> rather than being forced to operate below its promised low limit. >> >> This has led to various internal workarounds like: >> - don't charge any memory to interior tree nodes (e.g. parent_memcg); >> only charge memory to cgroup leafs. This gets tricky when dealing >> with reparented memory inherited to parent from child during cgroup >> deletion. > > Do those need any protection at all? Interior tree nodes don't need protection from their children. But children and interior nodes need protection from siblings and parents. >> - don't set low_limit on non leafs (e.g. do not set low limit on >> parent_memcg). This constrains the cgroup layout a bit. Some >> customers want to purchase $MEM and setup their workload with a few >> child cgroups. A system daemon hands out $MEM by setting low_limit >> for top-level containers (e.g. parent_memcg). Thereafter such >> customers are able to partition their workload with sub memcg below >> child_memcg. Example: >> parent_memcg >> \ >> child_memcg >> / \ >> server backup > > I think that the low_limit makes sense where you actually want to > protect something from reclaim. And backup sounds like a bad fit for > that. The backup job would presumably have a small low_limit, but it may still have a minimum working set required to make useful forward progress. Example: parent_memcg \ child_memcg limit 500, low_limit 500, usage 500 / \ | backup limit 10, low_limit 10, usage 10 | server limit 490, low_limit 490, usage 490 One could argue that problems appear when server.low_limit+backup.lower_limit=child_memcg.limit. So the safer configuration is leave some padding: server.low_limit + backup.low_limit + padding = child_memcg.limit but this just defers the problem. As memory is reparented into parent, then padding must grow. >> Thereafter customers often want some weak isolation between server and >> backup. To avoid undesired oom kills the server/backup isolation is >> provided with a softer memory guarantee (e.g. soft_limit). The soft >> limit acts like the low_limit until priority becomes desperate. > > Johannes was already suggesting that the low_limit should allow for a > weaker semantic as well. I am not very much inclined to that but I can > leave with a knob which would say oom_on_lowlimit (on by default but > allowed to be set to 0). We would fallback to the full reclaim if > no groups turn out to be reclaimable. I like the strong semantic of your low_limit at least at level:1 cgroups (direct children of root). But I have also encountered situations where a strict guarantee is too strict and a mere preference is desirable. Perhaps the best plan is to continue with the proposed strict low_limit and eventually provide an additional mechanism which provides weaker guarantees (e.g. soft_limit or something else if soft_limit cannot be altered). These two would offer good support for a variety of use cases. I thinking of something like: bool mem_cgroup_reclaim_eligible(struct mem_cgroup *memcg, struct mem_cgroup *root, int priority) { do { if (memcg == root) break; if (!res_counter_low_limit_excess(&memcg->res)) return false; if ((priority >= DEF_PRIORITY - 2) && !res_counter_soft_limit_exceed(&memcg->res)) return false; } while ((memcg = parent_mem_cgroup(memcg))); return true; } But this soft_limit,priority extension can be added later. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>