Hi Nikolai! On Sat 14-12-24 22:58:24, Nikolai Zhubr wrote: > On 12/13/24 19:12, Theodore Ts'o wrote: > > Note that some hardware errors can be caused by one-off errors, such > > as cosmic rays causing a bit-flip in memory DIMM. If that happens, > > RAID won't save you, since the error was introduced before an updated > > Certainly cosmic rays is a possibility, but based on previous episodes I'd > still rather bet on a more usual "subtle interaction" problem, either exact > same or some similar to [1]. > I even tried to run an existing test for this particular case as described > in [2] but it is not too user-friendly and somehow exits abnormally without > actually doing any interesting work. I'll get back to it later when I have > some time. > > [1] https://lore.kernel.org/stable/20231205122122.dfhhoaswsfscuhc3@quack3/ > [2] https://lwn.net/Articles/954364/ > > > The location of block allocation bitmaps never gets changed, so this > > sort of thing only happens due to hardware-induced corruption. > > Well, unless e.g. some modified sectors start being flushed to random wrong > offsets, like in [1] above, or something similar. Note that above bug led to writing file data to another position in that file. As such it cannot really lead to metadata corruption. Corrupting data in a file is relatively frequent event (given the wide variety of manipulations we do with file data). OTOH I've never seen corrupting metadata like this (in particular because ext4 has additional sanity checks that newly allocated blocks don't overlap with critical fs metadata). In theory, there could be software bug leading to writing sector to a wrong position but frankly, in all the cases I've investigated so far such bugs ended up being HW related. > > Otherwise, I strongly encourage you to learn, and to take > > responsibility for the health of your own system. And ideally, you > > can also use that knowledge to help other users out, which is the only > > way the free-as-in-beer ecosystem can flurish; by having everybody > > True. Generally I try to follow that, as much as appears possible. > It is sad a direct communication end-user-to-developer for solving issues is > becoming increasingly problematic here. On one hand I understand you, on the other hand back in the good old days (and I remember those as well ;) you wouldn't get much help when running over three years old kernel either. And I understand you're running a stable kernel that gets at least some updates but that's meant more for companies that build their products on top of that and have teams available for debugging issues. For an enduser I find some distribution kernels (Debian, Ubuntu, openSUSE, Fedora) more suitable as they get much more scrutiny before being released than -stable and also people there are more willing to look at issues with older kernels (that are still supported by the distro). Honza -- Jan Kara <jack@xxxxxxxx> SUSE Labs, CR