On Mon, Jul 1, 2024 at 5:33 PM Builder <yshuiv7@xxxxxxxxx> wrote: > > On Sun, Jun 30, 2024 at 10:58 AM Pedro Falcato <pedro.falcato@xxxxxxxxx> wrote: > > > > Hi everyone, > > Hi, > > I think I have hit this problem a well. I actually reported this on RedHat's > bug tracker a while back, along with a couple of stack traces: > > https://bugzilla.redhat.com/show_bug.cgi?id=2275252 > > Reverting the commit I mentioned there seems to make this problem go away for > me. This is a long shot, but I am curious if it will also fix the problem for > you. > > (Also inserting myself into this thread so I will get updates.) > > Regards, > Yuxuan Shui This looks like a different issue. The hang-up is one task waiting for the mutex lock (&acomp_ctx->mutex), whose holder is the other task that crashes. Looking at that trace in particular, the line that triggers the BUG_ON call (mm/zswap.c:1395): BUG_ON(crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), &acomp_ctx->wait)); is the compressor failing to decompress the data. This looks like some sort of memory corruption, and could happen for a lot of reasons - a zswap bug, a backend allocator bug, a compression library bug, or a hardware issue that corrupts memory. If it only happens on 6.8.9 (and not 6.8.5), then it's likely some changes in between, but I'd be very surprised if the bug somehow comes from the patch you reverted. If you look at the patch's content, all it does is essentially handling the case where the shrinker receives a NULL memcg, by using an alternative source of stats. It could potentially reveal the problems previously hidden, but definitely not the cause of those problems itself. I'd recommend that you send a separate bug report with the build config, steps to reproduce, and more information about your setup overall (what backend allocator are you using for zswap - it should be zsmalloc btw, what compression algorithm you are using, etc.)