Re: [BUG -next] "memcg: charge before adding to swapcache on swapin" broken

Shakeel Butt <shakeelb@xxxxxxxxxx> · Wed, 17 Mar 2021 08:44:14 -0700

On Wed, Mar 17, 2021 at 8:26 AM Heiko Carstens <hca@xxxxxxxxxxxxx> wrote:
>
> On Wed, Mar 17, 2021 at 06:33:24AM -0700, Shakeel Butt wrote:
> > > Ah, sorry. This is the s390 output for exception-traces. That is if
> > > /proc/sys/debug/exception-trace is set to one, and a process gets
> > > killed because of an unhandled signal.
> > >
> > > In this particular case sshd was killed because it tried to access
> > > address zero, where nothing is mapped.
> > >
> > > Given that all higher registers are zero in the register dump above my
> > > guess would be this happened because a stack page got unmapped, and
> > > when it got accessed to restore register contents a zero page was
> > > mapped in instead of the real old page contents.
> > >
> > > We have also all other sorts of crashes in our CI with linux-next
> > > currently, e.g. LTP's testcase "swapping01" seems to be able to make
> > > (more or less) sure that the init process get's killed (-> panic).
> >
> > I have tried the elfutils selftests and swapping01 on x86_64 VM and I
> > am not able to reproduce the issue. Can you give a bit more detail of
> > the setup along with the config file? I am assuming you are not
> > creating cgroups as these tests do not manipulate cgroups. Also is the
> > memory controller on your system on v1 or v2?
> >
> > I am fine with dropping the patch from mm-tree until we know more
> > about this issue.
>
> This is a Fedora 33 system with 2 CPUs, 2 GB memory and 20 GB swap
> space (yes...).
>
> It should be cgroups v2:
>
> # mount
> ...
> cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime,seclabel,nsdelegate)
>
> Config below. And the fun thing is that I cannot reproduce it today
> anymore with the elfutils test case - what _seems_ to be different is
> that the test suite runs much faster than yesterday evening. Usually
> an indication that there is no steal time (other guests which steal
> cpu time), which again _could_ indicate a race / lack of locking
> somewhere.
> This is kind of odd, since yesterday evening it was very reliable to
> trigger the bug :/
>

Thanks for the config. One question regarding swap, is it disk based
swap or zram?

By guests, do you mean there was another significant workload running
on the machine in parallel to the tests?

If you don't mind can you try swapping01 as well.