On Wed, Mar 18, 2020 at 2:57 AM Michal Hocko <mhocko@xxxxxxxxxx> wrote: > > On Tue 17-03-20 12:00:45, Ami Fischman wrote: > > On Tue, Mar 17, 2020 at 11:26 AM Robert Kolchmeyer > > <rkolchmeyer@xxxxxxxxxx> wrote: > > > > > > On Tue, Mar 10, 2020 at 3:54 PM David Rientjes <rientjes@xxxxxxxxxx> wrote: > > > > > > > > Robert, could you elaborate on the user-visible effects of this issue that > > > > caused it to initially get reported? > > > > > > Ami (now cc'ed) knows more, but here is my understanding. > > > > Robert's description of the mechanics we observed is accurate. > > > > We discovered this regression in the oom-killer's behavior when > > attempting to upgrade our system. The fraction of the system that > > went unhealthy due to this issue was approximately equal to the > > _sum_ of all other causes of unhealth, which are many and varied, > > but each of which contribute only a small amount of > > unhealth. This issue forced a rollback to the previous kernel > > where we ~never see this behavior, returning our unhealth levels > > to the previous background levels. > > Could you be more specific on the good vs. bad kernel versions? Because > I do not remember any oom changes that would affect the > time-to-check-time-to-kill race. The timing might be slightly different > in each kernel version of course. The original upgrade attempt included a large window of kernel versions: 4.14.137 to 4.19.91. In attempting to narrow down the failure we found that in tests of 10 runs we went from 0/10 failures to 1/10 failure with the update from https://chromium.googlesource.com/chromiumos/third_party/kernel/+/74fab24be8994bb5bb8d1aa8828f50e16bb38346 (based on 4.19.60) to https://chromium.googlesource.com/chromiumos/third_party/kernel/+/6e0fef1b46bb91c196be56365d9af72e52bb4675 (also based on 4.19.60) and then we went from 1/10 failures to 9/10 failures with the upgrade to https://chromium.googlesource.com/chromiumos/third_party/kernel/+/a33dffa8e5c47b877e4daece938a81e3cc810b90 (which I believe is based on 4.19.72). (this was all before we had the minimal repro yielding Robert's 61/100->0/100 stat in his previous email)