Re: ext4 damage suspected in between 5.15.167 - 5.15.170

Nikolai Zhubr <zhubr.2@xxxxxxxxx> · Sat, 14 Dec 2024 22:58:24 +0300

Hi Ted,

On 12/13/24 19:12, Theodore Ts'o wrote:
stable@xxxxxxxxxx" to the commit description.  However, they are not
obligated to do that, so there is an auxillary system which uses AI to
intuit which patches might be a bug fix.  There is also automated
systems that try to automatically figure out which patches might be

Oh, so meanwhile it got even worse than I used to imagine :-) Thanks for 
pointing out.

Note that some hardware errors can be caused by one-off errors, such
as cosmic rays causing a bit-flip in memory DIMM.  If that happens,
RAID won't save you, since the error was introduced before an updated

Certainly cosmic rays is a possibility, but based on previous episodes 
I'd still rather bet on a more usual "subtle interaction" problem, 
either exact same or some similar to [1].
I even tried to run an existing test for this particular case as 
described in [2] but it is not too user-friendly and somehow exits 
abnormally without actually doing any interesting work. I'll get back to 
it later when I have some time.

[1] https://lore.kernel.org/stable/20231205122122.dfhhoaswsfscuhc3@quack3/
[2] https://lwn.net/Articles/954364/

The location of block allocation bitmaps never gets changed, so this
sort of thing only happens due to hardware-induced corruption.

Well, unless e.g. some modified sectors start being flushed to random 
wrong offsets, like in [1] above, or something similar.

Looking at the dumpe2fs output, it looks like it was created
relatively recently (July 2024) but it doesn't have the metadata
checksum feature enabled, which has been enabled for quite a long

Yes. That was intentional - for better compatibility with even more 
ancient stuff. Maybe time has come to reconsider the approach though.

You got lucky because it block allocation bitmap location was
corrupted to an obviously invalid value.  But if it had been a

Absolutely. I was really amazed when I realized that :-)
It saved me days or even weeks of unnecessary verification work.

Otherwise, I strongly encourage you to learn, and to take
responsibility for the health of your own system.  And ideally, you
can also use that knowledge to help other users out, which is the only
way the free-as-in-beer ecosystem can flurish; by having everybody

True. Generally I try to follow that, as much as appears possible.
It is sad a direct communication end-user-to-developer for solving 
issues is becoming increasingly problematic here.
Anyway, thank you for friendly speech, useful hints and good references!

Regards,

Nick

helping each other.  Who knows, maybe you could even get a job doing
it for a living.  :-) :-) :-)

Cheers,