Re: + mm-consider-subtrees-in-memoryevents.patch added to -mm tree

Johannes Weiner <hannes@xxxxxxxxxxx> · Fri, 17 May 2019 21:33:48 -0400

On Fri, May 17, 2019 at 02:33:10PM +0200, Michal Hocko wrote:
> On Thu 16-05-19 15:39:43, Johannes Weiner wrote:
> > On Thu, May 16, 2019 at 08:10:42PM +0200, Michal Hocko wrote:
> > > On Thu 16-05-19 13:56:55, Johannes Weiner wrote:
> > > > On Wed, Feb 13, 2019 at 01:47:29PM +0100, Michal Hocko wrote:
> [...]
> > > > > FTR: As I've already said here [1] I can live with this change as long
> > > > > as there is a larger consensus among cgroup v2 users. So let's give this
> > > > > some more time before merging to see whether there is such a consensus.
> > > > > 
> > > > > [1] http://lkml.kernel.org/r/20190201102515.GK11599@xxxxxxxxxxxxxx
> > > > 
> > > > It's been three months without any objections.
> > > 
> > > It's been three months without any _feedback_ from anybody. It might
> > > very well be true that people just do not read these emails or do not
> > > care one way or another.
> > 
> > This is exactly the type of stuff that Mel was talking about at LSFMM
> > not even two weeks ago. How one objection, however absurd, can cause
> > "controversy" and block an effort to address a mistake we have made in
> > the past that is now actively causing problems for real users.
> > 
> > And now after stalling this fix for three months to wait for unlikely
> > objections, you're moving the goal post. This is frustrating.
> 
> I see your frustration but I find the above wording really unfair. Let me
> remind you that this is a considerable user visible change in the
> semantic and that always has to be evaluated carefuly. A change that would
> clearly regress anybody who rely on the current semantic. This is not an
> internal implementation detail kinda thing.
> 
> I have suggested an option for the new behavior to be opt-in which
> would be a regression safe option. You keep insisting that we absolutely
> have to have hierarchical reporting by default for consistency reasons.
> I do understand that argument but when I weigh consistency vs. potential
> regression risk I rather go a conservative way. This is a traditional
> way how we deal with semantic changes like this. There are always
> exceptions possible and that is why I wanted to hear from other users of
> cgroup v2, even from those who are not directly affected now.

I have acknowledged this concern in previous discussions. But the rule
is "don't break userspace", not "never change behavior". We do allow
the latter when it's highly unlikely that anyone would mind and the
new behavior is a much better default for current and future users.

Let me try to make the case for exactly this:

- Adoption data suggests that cgroup2 isn't really used yet. RHEL8 was
  just released with cgroup1 per default. Fedora is currently debating
  a switch. None of the other distros default to cgroup2. There is an
  article on the lwn frontpage *right now* about Docker planning on
  switching to cgroup2 in the near future. Kubernetes is on
  cgroup1. Android is on cgroup1. Shakeel agrees that Facebook is
  probably the only serious user of cgroup2 right now. The cloud and
  all mainstream container software is still on cgroup1.

- Using this particular part of the interface is a fairly advanced
  step in the cgroup2 adoption process. We've been using cgroup2 for a
  while and we've only now started running into this memory.events
  problem as we're enhancing our monitoring and automation
  infrastructure. If we're the only serious deployment, and we just
  started noticing it, what's the chance of regressing someone else?

- Violating expectations costs users time and money either way, but
  the status quo is much more costly: somebody who expects these
  events to be local could see events that did occur at an
  unexpectedly higher level of the tree. But somebody who expects
  these events to be hierarchical will miss real events entirely!

  Now, for an alarm and monitoring infrastructure, what is worse: to
  see occurring OOM kills reported at a tree level you don't expect?
  Or to *miss* occurring OOM kills that you're trying to look out for?

  Automatic remediation might not be as clear-cut, but for us, and I
  would assume many others, an unnecessary service restart or failover
  would have a shorter downtime than missing a restart after a kill.

- The status quo is more likely to violate expectations, given how the
  cgroup2 interface as a whole is designed.

  We have seen this in practice: memory.current is hierarchical,
  memory.stat is hierarchical, memory.pressure is hierarchical - users
  expect memory.events to be hierarchical. This misunderstanding has
  already cost us time and money.

  Chances are, even if there were other users of memory.events, that
  they're using the interface incorrectly and haven't noticed yet,
  rather than relying on the inconsistency.

  It's not a hypothetical, we have seen this with our fleet customers.

So combining what we know about

1. the current adoption rate
2. likely user expectations
3. the failure mode of missing pressure and OOM kill signals

means that going with the conservative option and not fixing this
inconsistency puts pretty much all users that will ever use this
interface at the risk of pain, outages and wasted engineering hours.

Making the fix available but opt-in has the same result for everybody
that isn't following this thread/patch along.

All that to protect an unlikely existing cgroup2 user from something
they are even less likely have noticed, let alone rely on.

This sounds like a terrible trade-off to me. I don't think it's a
close call in this case.

I understand that we have to think harder and be more careful with
changes like this. The bar *should* be high. But in this case, there
doesn't seem to be a real risk of regression for anybody, while the
risk of the status quo causing problems is high and happening. These
circumstances should be part of the decision process.