[Bug 217572] Initial blocked tasks causing deterioration over hours until (nearly) complete system lockup and data loss with PostgreSQL 13

bugzilla-daemon@xxxxxxxxxx · Fri, 03 Nov 2023 12:52:34 +0000

https://bugzilla.kernel.org/show_bug.cgi?id=217572

--- Comment #22 from Christian Theune (ct@xxxxxxxxxxxxxxx) ---
(In reply to Dave Chinner from comment #21)
> 
> This is still an unreproducable, unfixed bug in upstream kernels.
> There is no known reproducer, so actually triggering it and hence
> performing RCA is extremely difficult at this point in time. We don't
> really even know what workload triggers it.

It seems IO-pressure related and we've seen it multiple times with various
PostgreSQL activities.

I've planned time for next week to analyze this further and trying to help
establishing a reproducer.

> > We've had a multitude of crashes in the last weeks with the following
> > statistics:
> > 
> > 6.1.31 - 2 affected machines
> > 6.1.35 - 1 affected machine
> > 6.1.37 - 1 affected machine
> > 6.1.51 - 5 affected machines
> > 6.1.55 - 2 affected machines
> > 6.1.57 - 2 affected machines
> 
> Do these machines have ECC memory?

The physical hosts do. The affected systems are all Qemu/KVM virtual machines,
though.

> > Here's the more detailed behaviour of one of the machines with 6.1.57.
> > 
> > $ uptime
> >  16:10:23  up 13 days 19:00,  1 user,  load average: 3.21, 1.24, 0.57
> 
> Yeah, that's the problem - such a rare, one off issue that we don't
> really even know where to begin looking. :(
> 
> Given you seem to have a workload that occasionally triggers it,
> could you try to craft a reproducer workload that does stuff similar
> to your production workload and see if you can find out something
> that makes this easier to trigger?

Yup. I'm prioritizing this for the next weeks.

> This implies you are using memcg to constrain memory footprint of
> the applications? Are these workloads running in memcgs that
> experience random memcg OOM conditions? Or maybe the failure
> correlates with global OOM conditions triggering memcg reclaim?

I'll have to read up on what memcg is and whether we're doing anything with it
on purpose. At the moment I think this is just whatever we're getting from our
baseline environment with kernel or distro defaults. 

How do I notice a memcg OOM? I've always tried to correlate all kernel log
messages and haven't seen any other tracebacks than the ones I posted.

Global (so I guess a "regular") OOM wasn't involved in any case so far.

I can try digging deeper into system VM statistics. We're running
telegraf/prometheus and have a relatively exhaustive number of system variables
we're monitoring on all systems. Anything specific I could look for?

> 
> Cheers,
> 
> Dave.

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.