Re: zswap_writeback_entry crashes in 6.9.5

Nhat Pham <nphamcs@xxxxxxxxx> · Tue, 2 Jul 2024 08:28:00 -0700

On Mon, Jul 1, 2024 at 5:33 PM Builder <yshuiv7@xxxxxxxxx> wrote:
>
> On Sun, Jun 30, 2024 at 10:58 AM Pedro Falcato <pedro.falcato@xxxxxxxxx> wrote:
> >
> > Hi everyone,
>
> Hi,
>
> I think I have hit this problem a well. I actually reported this on RedHat's
> bug tracker a while back, along with a couple of stack traces:
>
> https://bugzilla.redhat.com/show_bug.cgi?id=2275252
>
> Reverting the commit I mentioned there seems to make this problem go away for
> me. This is a long shot, but I am curious if it will also fix the problem for
> you.
>
> (Also inserting myself into this thread so I will get updates.)
>
> Regards,
> Yuxuan Shui

This looks like a different issue. The hang-up is one task waiting for
the mutex lock (&acomp_ctx->mutex), whose holder is the other task
that crashes. Looking at that trace in particular, the line that
triggers the BUG_ON call (mm/zswap.c:1395):

BUG_ON(crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req),
&acomp_ctx->wait));

is the compressor failing to decompress the data. This looks like some
sort of memory corruption, and could happen for a lot of reasons - a
zswap bug, a backend allocator bug, a compression library bug, or a
hardware issue that corrupts memory.

If it only happens on 6.8.9 (and not 6.8.5), then it's likely some
changes in between, but I'd be very surprised if the bug somehow comes
from the patch you reverted. If you look at the patch's content, all
it does is essentially handling the case where the shrinker receives a
NULL memcg, by using an alternative source of stats. It could
potentially reveal the problems previously hidden, but definitely not
the cause of those problems itself. I'd recommend that you send a
separate bug report with the build config, steps to reproduce, and
more information about your setup overall (what backend allocator are
you using for zswap - it should be zsmalloc btw, what compression
algorithm you are using, etc.)