Re: ext4 damage suspected in between 5.15.167 - 5.15.170

Jan Kara <jack@xxxxxxx> · Mon, 16 Dec 2024 13:59:41 +0100

Hi Nikolai!

On Sat 14-12-24 22:58:24, Nikolai Zhubr wrote:
> On 12/13/24 19:12, Theodore Ts'o wrote:
> > Note that some hardware errors can be caused by one-off errors, such
> > as cosmic rays causing a bit-flip in memory DIMM.  If that happens,
> > RAID won't save you, since the error was introduced before an updated
> 
> Certainly cosmic rays is a possibility, but based on previous episodes I'd
> still rather bet on a more usual "subtle interaction" problem, either exact
> same or some similar to [1].
> I even tried to run an existing test for this particular case as described
> in [2] but it is not too user-friendly and somehow exits abnormally without
> actually doing any interesting work. I'll get back to it later when I have
> some time.
> 
> [1] https://lore.kernel.org/stable/20231205122122.dfhhoaswsfscuhc3@quack3/
> [2] https://lwn.net/Articles/954364/
> 
> > The location of block allocation bitmaps never gets changed, so this
> > sort of thing only happens due to hardware-induced corruption.
> 
> Well, unless e.g. some modified sectors start being flushed to random wrong
> offsets, like in [1] above, or something similar.

Note that above bug led to writing file data to another position in that
file. As such it cannot really lead to metadata corruption. Corrupting data
in a file is relatively frequent event (given the wide variety of
manipulations we do with file data). OTOH I've never seen corrupting
metadata like this (in particular because ext4 has additional sanity checks
that newly allocated blocks don't overlap with critical fs metadata). In
theory, there could be software bug leading to writing sector to a wrong
position but frankly, in all the cases I've investigated so far such bugs
ended up being HW related.

> > Otherwise, I strongly encourage you to learn, and to take
> > responsibility for the health of your own system.  And ideally, you
> > can also use that knowledge to help other users out, which is the only
> > way the free-as-in-beer ecosystem can flurish; by having everybody
> 
> True. Generally I try to follow that, as much as appears possible.
> It is sad a direct communication end-user-to-developer for solving issues is
> becoming increasingly problematic here.

On one hand I understand you, on the other hand back in the good old days
(and I remember those as well ;) you wouldn't get much help when running
over three years old kernel either. And I understand you're running a stable
kernel that gets at least some updates but that's meant more for companies
that build their products on top of that and have teams available for
debugging issues. For an enduser I find some distribution kernels (Debian,
Ubuntu, openSUSE, Fedora) more suitable as they get much more scrutiny
before being released than -stable and also people there are more willing
to look at issues with older kernels (that are still supported by the
distro).

								Honza
-- 
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR