[Bug 217572] Initial blocked tasks causing deterioration over hours until (nearly) complete system lockup and data loss with PostgreSQL 13

bugzilla-daemon@xxxxxxxxxx · Tue, 07 Nov 2023 10:25:59 +0000

https://bugzilla.kernel.org/show_bug.cgi?id=217572

--- Comment #24 from KN (kernel@xxxxxxxxxxxxxxxxxxxx) ---
Long time lurker here offering a potential workaround.

We experienced near identical kernel issues as mentioned here with a completely
different setup. We saw the issue on our OKD cluster (4.12 and 4.13) running on
Fedora CoreOS (37 and 38). We had ~70 nodes with a specific workload profile,
and of these, anywhere between 1 and 5 would run into this issue each night on
our production cluster. These nodes were very IO intensive (druid
middlemanager/ingest nodes) but not database related. The persistent volumes
that were contributing the majority of the disk IO were configured as xfs. We
tried for weeks to reproduce this error but could not.

Whilst we have to accept this is a kernel bug and not an xfs bug, we *resolved*
our issues by switching from xfs to ext4. Haven't had a single instance of this
error since we migrated our persistent volumes away from xfs. 

3 weeks and counting and not a single failure.

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.