On Thu, Sep 05, 2024 at 09:15:42AM -0400, Theodore Ts'o wrote: > Hmm, so you can reliably reproduce this when you boot the VM into > 6.10, but not when using 6.7? And this is on an arm64 system, > correct? What is the VM doing? Does this show up when you boot it, > or when you kick off some kind of appliance? What sort of devices are > you using with the VM, and what is the VMM that you are using (e.g., > qemu, Oracle's Cloud, etc.)? > > Is the VM image something you can share? Or can you share the > metadata-only image using the e2image -Q command? Hi Ted, I've had another instance of it, discovered today. A VM that was running fine yesterday was dead this morning - ssh closed any attempt to connect within a few seconds. Trying to login through the hypervisor console resulted in a response to typing and then nothing further. Even sysrq was dead because the console had been closed. I dumped the VM memory to disk - its only 128MB and set about extracting the kernel messages, and found these as the last messages: EXT4-fs error (device vda1): ext4_lookup:1854: inode #32801: comm mandb: iget: checksum invalid Aborting journal on device vda1-8.8 EXT4-fs (vda1): Remounting filesystem read-only So, we now have the cause and inode number, and the reason (invalid checksum). I discovered today that ext4 records details of the error on the filesystem, accessible using tune2fs: First error time: Fri Oct 11 07:18:38 2024 First error function: ext4_lookup First error line #: 1854 First error inode #: 32801 First error err: EFSBADCRC Last error time: Fri Oct 11 07:18:38 2024 Last error function: ext4_lookup Last error line #: 1854 Last error inode #: 32801 Last error err: EFSBADCRC Checksum type: crc32c Checksum: 0xb71f356b Running e2fsck -n on the filesystem reveals no checksum errors, only two block/inode counts being wrong: e2fsck 1.46.2 (28-Feb-2021) Warning: skipping journal recovery because doing a read-only filesystem check. /dev/loop0p1 contains a file system with errors, check forced. Pass 1: Checking inodes, blocks, and sizes Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information Free blocks count wrong (106574, counted=106578). Fix? no Free inodes count wrong (82348, counted=82353). Fix? no /dev/loop0p1: 48724/131072 files (1.2% non-contiguous), 417202/523776 blocks So, I think we can conclude that the on-disk checksums are correct, and apart from the filesystem still being mounted in the guest, albiet now in read-only errored state with an unrecovered journal, I think e2fsck result is healthy. This reminded me of our previous thread where checksums were apparently invalid, which came down to a compiler bug miscompiling the ext4 checksum code. https://lore.kernel.org/all/20210105195004.GF1551@xxxxxxxxxxxxxxxxxxxxx/ However, after having looked at the generated code, it isn't a repeat of that (thankfully!) However, it does have the feel of something very very subtle - a specific set of swiss cheese slices that come together such that all the holes line up... Let's have a look at the data for the inode on disk: debugfs: ncheck 32801 Inode Pathname 32801 /usr/share/man/man3/tmpnam.3.gz debugfs: stat <32801> Inode: 32801 Type: regular Mode: 0644 Flags: 0x80000 Generation: 1819410294 Version: 0x00000000:00000004 User: 0 Group: 0 Project: 0 Size: 1623 File ACL: 0 Links: 1 Blockcount: 8 Fragment: Address: 0 Number: 0 Size: 0 ctime: 0x649205ad:2957c9bc -- Tue Jun 20 21:01:49 2023 atime: 0x64920601:4d5bb184 -- Tue Jun 20 21:03:13 2023 mtime: 0x644e45fe:00000000 -- Sun Apr 30 11:42:06 2023 crtime: 0x649205ac:7432ceb4 -- Tue Jun 20 21:01:48 2023 Size of extra inode fields: 32 Inode checksum: 0x569f27e5 EXTENTS: (0):59627 debugfs: imap <32801> Inode 32801 is part of block group 4 located at block 2339, offset 0x0000 debugfs: id <32801> 0000 a481 0000 5706 0000 0106 9264 ad05 9264 ....W......d...d 0020 fe45 4e64 0000 0000 0000 0100 0800 0000 .ENd............ 0040 0000 0800 0400 0000 0af3 0100 0400 0000 ................ 0060 0000 0000 0000 0000 0100 0000 ebe8 0000 ................ 0100 0000 0000 0000 0000 0000 0000 0000 0000 ................ * 0140 0000 0000 76ff 716c 0000 0000 0000 0000 ....v.ql........ 0160 0000 0000 0000 0000 0000 0000 e527 0000 .............'.. 0200 2000 9f56 bcc9 5729 0000 0000 84b1 5b4d ..V..W)......[M 0220 ac05 9264 b4ce 3274 0000 0000 0000 0000 ...d..2t........ 0240 0000 0000 0000 0000 0000 0000 0000 0000 ................ * We can see the checksum in there, so let's go searching the memory dump for it. This is the hexdump -C offset from the start of memory. PAGE_OFFSET=0xc0000000. The only place the bytes corresponding to the checksum (e5, 27, ... 57, 29) appear are: 07973000 a4 c7 04 12 9b 86 ca 80 e4 21 cd 00 4f 80 3d 88 |.........!..O.=.| 07973010 fe 45 4e 64 00 00 00 00 00 00 01 00 08 00 00 00 |.ENd............| 07973020 00 00 08 00 04 00 00 00 0a f3 01 00 04 00 00 00 |................| 07973030 00 00 00 00 00 00 00 00 01 00 00 00 eb e8 00 00 |................| 07973040 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| 07973060 00 00 00 00 76 ff 71 6c 00 00 00 00 00 00 00 00 |....v.ql........| 07973070 00 00 00 00 00 00 00 00 00 00 00 00 e5 27 00 00 |.............'..| 07973080 20 00 9f 56 bc c9 57 29 00 00 00 00 84 b1 5b 4d | ..V..W)......[M| 07973090 ac 05 92 64 b4 ce 32 74 00 00 00 00 00 00 00 00 |...d..2t........| 079730a0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| All other bytes correspond except for the first 16 bytes (well, the first byte is correct but the others in that row are all different. Note the block number for the data - 0xe8eb (59627). To make sure that there isn't a journal entry pending on disk that might update the inode with what I found in memory, I naievely searched the journal blocks for the block number in the embedded ext4 extent in the inode and found nothing. Either that isn't sufficient, or there isn't anything outstanding. In any case, if the timestamps had changed (bytes 8-15) then the nanosecond/epoch parts that appear later in the ext4_inode structure should also be different. It would be very lucky to only change the seconds part of the timestamp! We know that ext4_inode is looked up through a buffer head, and from the imap command in debugfs, this inode is at block 2339 (0x923) offset 0. We find a pointer to 0xc7973000 only once in the entire memory dump at offset 185af9c: 0185af80 19 00 00 00 80 af 85 c1 2c a4 fe c7 00 00 00 00 |........,.......| 0185af90 23 09 00 00 00 00 00 00 00 10 00 00 00 30 97 c7 |#............0..| 0185afa0 80 a0 80 c1 50 84 43 c0 00 00 00 00 ac af 85 c1 |....P.C.........| 0185afb0 ac af 85 c1 00 00 00 00 00 00 00 00 00 00 00 00 |................| which appears to be the struct buffer_head. b_state = 0x00000019 BH_Mapped | BH_Req | BH_Uptodate b_this_page = 0xc185af80 b_page/b_folio = 0xc7fea42c b_blocknr = 0x923 (2339) - as expected b_size = 0x1000 (4096) - page size b_data = 0xc7973000 b_bdev = 0xc180a080 b_end_io = 0xc0438450 b_private = NULL b_assoc_buffers.next = 0xc185afac \ empty b_assoc_buffers.prev = 0xc185afac / list b_count = 0 Chasing this to the struct page (I don't think it's a folio because it isn't large enough) at offset 7fea42c: 07fea420 ff ff ff ff 02 00 00 00 00 70 6f c2 24 42 00 14 |.........po.$B..| 07fea430 a0 a3 fe c7 74 a5 fe c7 20 a3 80 c1 23 09 00 00 |....t... ...#...| 07fea440 80 af 85 c1 ff ff ff ff 02 00 00 00 00 d0 d6 c2 |................| flags = 0x14004224, which I believe mean PG_referenced | PG_lru | PG_workingset | PG_private | LRU_REFS(2) | LRU_GEN(2) lru.next = 0xc7fea3a0 lru.prev = 0xc7fea474 mapping = 0xc180a320 index = 0x923 (2339) - as expected private = 0xc185af80 - pointer to struct buffer_head above _map_count = 0xffffffff _ref_count = 2 memcg_data = 0xc2d6d000 So, it looks like the page is still in use and not free. This is the only place I can find a pointer to the struct buffer_head in the memory dump, apart from buffer_head.b_this_page. Now, given that the memory dump of the VM has been created hours after it's crashed, I can't see CPU cache being an issue, because if the data was corrupted at the time ext4 came across the error, that data would've been evicted a hours ago. I thought maybe something could be gleaned from the values in the corruption, but it makes no sense to me - they don't seem to be valid pointers: corrupted on-disk 0x1204c7a4 0x000081a4 0x80ca869b 0x00000657 0x00cd21e4 0x64920601 0x883d804f 0x649208ad So I'm rather stumped at the moment. I'm now thinking that this isn't an ext4 issue, but instead could be something really very subtle in the MM, virtio, or even qemu. I don't see it being qemu, because 6.7 was fine. This is the second 6.10 VM that has died on the machine - it's running exactly the same kernel binary as the other VM that previously died. I've saved the qcow of the filesystem, and also have the coredump of the VM's memory in case it needs further investigation. I'm about to throw 6.11 on at least some of these VMs (including the two that failed) so the on-disk filesystem is going to be e2fscked shortly. As I said above, I don't think this is an ext4 issue, but something corrupting ext4 in-memory data structures. It could be related to the VM having a relatively small amount of memory compared to modern standards (maybe adding MM pressure to tickle a bug in there). Or maybe we have another case of a tail-call optimisation gone wrong that corrupts a pointer causing ext4 in-memory data to be scribbled over. I'm grasping at straws at the moment though... -- RMK's Patch system: https://www.armlinux.org.uk/developer/patches/ FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!