Re: [BUG -next] "memcg: charge before adding to swapcache on swapin" broken

Shakeel Butt <shakeelb@xxxxxxxxxx> · Wed, 17 Mar 2021 06:33:24 -0700

On Tue, Mar 16, 2021 at 5:46 PM Heiko Carstens <hca@xxxxxxxxxxxxx> wrote:
>
> Hi Shakeel,
>
> > > your commit 3a9ca1b0ac0f ("memcg: charge before adding to swapcache on
> > > swapin") in linux-next 20210316 appears to cause user process faults /
> > > crashes on s390 like:
> > >
> > > User process fault: interruption code 003b ilc:3 in sshd[2aa15280000+df000]
> > > Failing address: 0000000000000000 TEID: 0000000000000800
> > > Fault in primary space mode while using user ASCE.
> > > AS:00000000966b41c7 R3:0000000000000024
> > > CPU: 0 PID: 401 Comm: sshd Not tainted 5.12.0-rc3-00048-geba7667a8534 #10
> > > Hardware name: IBM 8561 T01 703 (z/VM 7.2.0)
> > > User PSW : 0705000180000000 0000000000000000
> > >            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:1 AS:0 CC:0 PM:0 RI:0 EA:3
> > > User GPRS: 0000000000000000 fffffffffffff000 0000000000000001 000002aa157b88f0
> > >            000002aa157c43c0 0000000000000000 0000000000000000 0000000000000000
> > >            0000000000000000 0000000000000000 0000000000000000 0000000000000000
> > >            0000000000000000 0000000000000000 0000000000000000 0000000000000000
> > > User Code: Bad PSW.
> >
> > Thanks for the report. Can you please explain a bit what the above report tells?
>
> Ah, sorry. This is the s390 output for exception-traces. That is if
> /proc/sys/debug/exception-trace is set to one, and a process gets
> killed because of an unhandled signal.
>
> In this particular case sshd was killed because it tried to access
> address zero, where nothing is mapped.
>
> Given that all higher registers are zero in the register dump above my
> guess would be this happened because a stack page got unmapped, and
> when it got accessed to restore register contents a zero page was
> mapped in instead of the real old page contents.
>
> We have also all other sorts of crashes in our CI with linux-next
> currently, e.g. LTP's testcase "swapping01" seems to be able to make
> (more or less) sure that the init process get's killed (-> panic).

I have tried the elfutils selftests and swapping01 on x86_64 VM and I
am not able to reproduce the issue. Can you give a bit more detail of
the setup along with the config file? I am assuming you are not
creating cgroups as these tests do not manipulate cgroups. Also is the
memory controller on your system on v1 or v2?

I am fine with dropping the patch from mm-tree until we know more
about this issue.