Naresh Kamboju writes:
On Thu, 28 May 2020 at 20:33, Michal Hocko <mhocko@xxxxxxxxxx> wrote:
On Fri 22-05-20 02:23:09, Naresh Kamboju wrote:
> My apology !
> As per the test results history this problem started happening from
> Bad : next-20200430 (still reproducible on next-20200519)
> Good : next-20200429
>
> The git tree / tag used for testing is from linux next-20200430 tag and reverted
> following three patches and oom-killer problem fixed.
>
> Revert "mm, memcg: avoid stale protection values when cgroup is above
> protection"
> Revert "mm, memcg: decouple e{low,min} state mutations from protectinn checks"
> Revert "mm-memcg-decouple-elowmin-state-mutations-from-protection-checks-fix"
The discussion has fragmented and I got lost TBH.
In http://lkml.kernel.org/r/CA+G9fYuDWGZx50UpD+WcsDeHX9vi3hpksvBAWbMgRZadb0Pkww@xxxxxxxxxxxxxx
you have said that none of the added tracing output has triggered. Does
this still hold? Because I still have a hard time to understand how
those three patches could have the observed effects.
On the other email thread [1] this issue is concluded.
Yafang wrote on May 22 2020,
Regarding the root cause, my guess is it makes a similar mistake that
I tried to fix in the previous patch that the direct reclaimer read a
stale protection value. But I don't think it is worth to add another
fix. The best way is to revert this commit.
This isn't a conclusion, just a guess (and one I think is unlikely). For this
to reliably happen, it implies that the same race happens the same way each
time.