On Thu, May 16, 2019 at 08:10:42PM +0200, Michal Hocko wrote: > On Thu 16-05-19 13:56:55, Johannes Weiner wrote: > > On Wed, Feb 13, 2019 at 01:47:29PM +0100, Michal Hocko wrote: > > > On Tue 12-02-19 14:45:42, Andrew Morton wrote: > > > [...] > > > > From: Chris Down <chris@xxxxxxxxxxxxxx> > > > > Subject: mm, memcg: consider subtrees in memory.events > > > > > > > > memory.stat and other files already consider subtrees in their output, and > > > > we should too in order to not present an inconsistent interface. > > > > > > > > The current situation is fairly confusing, because people interacting with > > > > cgroups expect hierarchical behaviour in the vein of memory.stat, > > > > cgroup.events, and other files. For example, this causes confusion when > > > > debugging reclaim events under low, as currently these always read "0" at > > > > non-leaf memcg nodes, which frequently causes people to misdiagnose breach > > > > behaviour. The same confusion applies to other counters in this file when > > > > debugging issues. > > > > > > > > Aggregation is done at write time instead of at read-time since these > > > > counters aren't hot (unlike memory.stat which is per-page, so it does it > > > > at read time), and it makes sense to bundle this with the file > > > > notifications. > > > > > > > > After this patch, events are propagated up the hierarchy: > > > > > > > > [root@ktst ~]# cat /sys/fs/cgroup/system.slice/memory.events > > > > low 0 > > > > high 0 > > > > max 0 > > > > oom 0 > > > > oom_kill 0 > > > > [root@ktst ~]# systemd-run -p MemoryMax=1 true > > > > Running as unit: run-r251162a189fb4562b9dabfdc9b0422f5.service > > > > [root@ktst ~]# cat /sys/fs/cgroup/system.slice/memory.events > > > > low 0 > > > > high 0 > > > > max 7 > > > > oom 1 > > > > oom_kill 1 > > > > > > > > As this is a change in behaviour, this can be reverted to the old > > > > behaviour by mounting with the `memory_localevents' flag set. However, we > > > > use the new behaviour by default as there's a lack of evidence that there > > > > are any current users of memory.events that would find this change > > > > undesirable. > > > > > > > > Link: http://lkml.kernel.org/r/20190208224419.GA24772@xxxxxxxxxxxxxx > > > > Signed-off-by: Chris Down <chris@xxxxxxxxxxxxxx> > > > > Acked-by: Johannes Weiner <hannes@xxxxxxxxxxx> > > > > Cc: Michal Hocko <mhocko@xxxxxxxxxx> > > > > Cc: Tejun Heo <tj@xxxxxxxxxx> > > > > Cc: Roman Gushchin <guro@xxxxxx> > > > > Cc: Dennis Zhou <dennis@xxxxxxxxxx> > > > > Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> > > > > > > FTR: As I've already said here [1] I can live with this change as long > > > as there is a larger consensus among cgroup v2 users. So let's give this > > > some more time before merging to see whether there is such a consensus. > > > > > > [1] http://lkml.kernel.org/r/20190201102515.GK11599@xxxxxxxxxxxxxx > > > > It's been three months without any objections. > > It's been three months without any _feedback_ from anybody. It might > very well be true that people just do not read these emails or do not > care one way or another. This is exactly the type of stuff that Mel was talking about at LSFMM not even two weeks ago. How one objection, however absurd, can cause "controversy" and block an effort to address a mistake we have made in the past that is now actively causing problems for real users. And now after stalling this fix for three months to wait for unlikely objections, you're moving the goal post. This is frustrating. Nobody else is speaking up because the current user base is very small and because the idea that anybody has developed against and is relying on the current problematic behavior is completely contrived. In reality, the behavior surprises people and causes production issues. > > Can we merge this for > > v5.2 please? We still have users complaining about this inconsistent > > behavior (the last one was yesterday) and we'd rather not carry any > > out of tree patches. > > Could you point me to those complains or is this something internal? It's something internal, unfortunately, or I'd link to it. In this report yesterday, the user missed OOM kills that occured in nested subgroups of individual job components. They monitor the entire job status and health at the top-level "job" cgroup: total memory usage, VM activity and trends from memory.stat, pressure for cpu, io, memory etc. All of these are recursive. They assumed they could monitor memory.events likewise and were left in the assumption that everything was fine when in reality there was OOM killing going on in one of the leaves. Such negative surprises really suck. But what's worse is that now that they are aware of it, there is still no good solution for them because periodically polling the entire subtree for events in leaves is not practical. There could be a lot of cgroups, which is why we put so much effort recently into improving the hierarchical stat aggregation. I'd really like to get this fixed, and preferably in a way that does deviate from upstream, and does not force the same downtimes and wasted engineering hours on everybody who is going to switch to cgroup2 in the next couple of years.