https://bugzilla.kernel.org/show_bug.cgi?id=217572 --- Comment #22 from Christian Theune (ct@xxxxxxxxxxxxxxx) --- (In reply to Dave Chinner from comment #21) > > This is still an unreproducable, unfixed bug in upstream kernels. > There is no known reproducer, so actually triggering it and hence > performing RCA is extremely difficult at this point in time. We don't > really even know what workload triggers it. It seems IO-pressure related and we've seen it multiple times with various PostgreSQL activities. I've planned time for next week to analyze this further and trying to help establishing a reproducer. > > We've had a multitude of crashes in the last weeks with the following > > statistics: > > > > 6.1.31 - 2 affected machines > > 6.1.35 - 1 affected machine > > 6.1.37 - 1 affected machine > > 6.1.51 - 5 affected machines > > 6.1.55 - 2 affected machines > > 6.1.57 - 2 affected machines > > Do these machines have ECC memory? The physical hosts do. The affected systems are all Qemu/KVM virtual machines, though. > > Here's the more detailed behaviour of one of the machines with 6.1.57. > > > > $ uptime > > 16:10:23 up 13 days 19:00, 1 user, load average: 3.21, 1.24, 0.57 > > Yeah, that's the problem - such a rare, one off issue that we don't > really even know where to begin looking. :( > > Given you seem to have a workload that occasionally triggers it, > could you try to craft a reproducer workload that does stuff similar > to your production workload and see if you can find out something > that makes this easier to trigger? Yup. I'm prioritizing this for the next weeks. > This implies you are using memcg to constrain memory footprint of > the applications? Are these workloads running in memcgs that > experience random memcg OOM conditions? Or maybe the failure > correlates with global OOM conditions triggering memcg reclaim? I'll have to read up on what memcg is and whether we're doing anything with it on purpose. At the moment I think this is just whatever we're getting from our baseline environment with kernel or distro defaults. How do I notice a memcg OOM? I've always tried to correlate all kernel log messages and haven't seen any other tracebacks than the ones I posted. Global (so I guess a "regular") OOM wasn't involved in any case so far. I can try digging deeper into system VM statistics. We're running telegraf/prometheus and have a relatively exhaustive number of system variables we're monitoring on all systems. Anything specific I could look for? > > Cheers, > > Dave. -- You may reply to this email to add a comment. You are receiving this mail because: You are watching the assignee of the bug.