On Mon, Apr 18, 2022 at 07:33:38PM +1000, Peter Urbanec wrote: > First I updated e2fsprogs to 1.46.5 and then tried: > > # e2fsck -f -C 0 -b 32768 -n /dev/md0 > e2fsck 1.46.5 (30-Dec-2021) > Pass 1: Checking inodes, blocks, and sizes > Error reading block 32 (Attempt to read block from filesystem resulted in > short read). Ignore error? no > > e2fsck: aborted > > Initially things looked promising and there were no errors being reported > even at 47% of the way through the check. The check terminated at > approximately the 50% mark with what appears to be a 32-bit overflow issue. The 32-bit overflow issue is in the reporting of the error. struct_io_channel's read_error() callback function, which is set to e2fsck_handle_read_error(), is only called when there is an I/O error (that's the short read). So it's not the cause of the failure, but because of that bug, we don't know what the block number was where the apparently I/O error was found. Did you check the system logs to see if there was any kernel messages that mgiht give us a hint? Since you built e2fsprogs 1.42.6, can you build it making sure that debugging symbols are included, and then run the binary out under a debugger so when you get the error, you can ^C to escape out to the debugger, and then print the stack trace and step up to see where in the e2fsck prorgam we were trying to read the block, and what the block number should have been. So something like: ./configure CFLAGS=-g ; make ; gdb e2fsck/e2fsck > Ideally, the `block` member of `struct_io_channel` would be of type > `unsigned long long`, but changing this may introduce binary compatibility > issues. As an alternative, it may be prudent to perform an early check for > the total number of blocks in a file system and refuse to run e2fsck (and > other tools) if that number would cause an overflow. Well, what we can do is to add new callback functions which are 64-bit clean, and new versions of e2fsck can provide both the 32-bit and 64-bit error functions. Refusing to run is probably a bit extreme, since this merely is a cosmetic issue, and only when there is an I/O error. Creating 64-bit clean calllback functions io_channel's read_error() and write_error() was always part of the plan, and I thought we had done it already --- but because it was less critical since RAID arrays never fail ("What never? Well, hardly ever!"[1]) I just never got around to it. :-/ [1] https://gsarchive.net/pinafore/web_opera/pin04.html > Once I transplant the drives to a 64-bit machine, is there a way I could use > e2image to create a file that I can use to test whether an e2fsck run will > work? An alternative which might be easier would be to create a scratch test file system, using dm-thin which will allow you to simulate a very large file system with thin-provisioning[2]. It will only take (roughly) as much space as you write into it. [2] https://wiki.archlinux.org/title/LVM#Thin_provisioning Linux will tend to spread the use of its block groups over the LBA space, as you start filling it and create a bunch of directories. So if you, say, unpack a Linux kernel and then build it, and then unpack a different Linux kernel version tree in to a different directory hierarchy and build it, that should be enough to make sure that you are using a variety of block groups spaced out through the block device. You could then try running e2fsck on a 32-bit platform, and see if you can replicate the problem, and then do an off-line resize, etc., without risking your data and without requiring a huge amount of space. Cheers, - Ted