On 2024-09-30 22:29, Andreas Dilger wrote:
On Sep 27, 2024, at 8:38 AM, Jesper Dybdal<jd-ext4@xxxxxxxxx> wrote:
I have now a few times experienced a problem with the i_blocks field of a few inodes being corrupted (replaced by extremely large numbers).
I don't believe that it is a disk error - the file system is on a RAID1 partition and the RAID consistency is checked regularly.
I also find it hard to believe that it is a RAM error - the machine has run memtest86+ overnight without finding anything.
The files I've seen corrupted are simple small text files that are modified only using an ordinary text editor (emacs).
Fsck fixes it.
The system is an up-to-date Debian Bookworm:
Linux nuser 6.1.0-25-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.106-3 (2024-08-26) x86_64 GNU/Linux
I do one thing that is not the default for ext4: I use the "nodelalloc" option (because several years ago, there was a discussion about "delalloc or not" from which I got the impression that nodelalloc was probably slightly safer - if the resulting performance reduction is not a problem, which it is not for me):
/dev/md0 on / type ext4 (rw,relatime,nodelalloc,errors=remount-ro)
Three examples follow below. Note that the bad field values, when interpreted as 48-bit signed numbers, are numerically small negative numbers (-25, -9, -3, respectively).
Excerpts from the fsck logs:
root: Inode 10748715, i_blocks is 281474976710631, should be 5. FIXED.
root: Inode 10751288, i_blocks is 281474976710647, should be 3. FIXED.
root: Inode 10748542, i_blocks is 281474976710653, should be 1. FIXED.
I don't know when the first two of these corruptions occurred, but the last one happened yesterday or the day before. The file in question was /etc/fstab, and I discovered the problem after I had edited fstab on Wednesday and rebooted on Thursday.
The corrupted files can be read and copied without problems. I have not dared to delete any of those files before fsck had fixed them.
What is going on here?
This looks like an underflow of the used blocks count on the inode:
281474976710631 = 0xffffffffffe7
281474976710647 = 0xfffffffffff7
281474976710653 = 0xfffffffffffd
This is 2^48 blocks, which is the limit for the number of blocks that fit
into the available inode fields (32-bit i_blocks_lo, 16-bit i_blocks_hi).
There is likely some kind of accounting error in the code. Is anything
unusual with access patterns for those files (large xattrs/ACLs, are they
files or directories or special files. mmap, truncate, fallocate, etc.)?
No. They are all simple small text configuration files, and I edit them
using Emacs. The only slightly unusual thing is, as I wrote earlier,
that the file system is mounted with the nodelalloc option.
The files I have identified are fstab and two postfix configuration
files: /etc/postfix/{main.cf,master.cf} . The problem has actually hit
master.cf twice.
I have verified that the only reboot that happened between the fstab
edit on Wednesday and seeing the problem Thursday, was a clean
deliberate reboot - no power outage of similar.
If you are able to reproduce with the /etc/fstab editing, possibly strace
could help to identify if something unusual is being done to the file.
I'll try, but I do not really expect Emacs to do strange things to the file
Cheers, Andreas
Thanks,
Jesper
--
Jesper Dybdal
https://www.dybdal.dk