Yes, we can collect the information from the block associated to this ram_addr. But instead of duplicating the necessary code into both i386 and ARM, I came back to adding the change into the kvm_hwpoison_page_add() function called from both i386 and ARM specific code. I also needed a new possibility to retrieve the information while we are dealing with the SIGBUS signal, and created a new function to gather the information from the RAMBlock: qemu_ram_block_location_info_from_addr(ram_addr_t ram_addr, struct RAMBlockInfo *b_info) with the associated struct. So that we can use the RCU_READ_LOCK_GUARD() and retrieve all the data.
Makes sense.
Note about ARM failing on large pages: ----------=====---------------------- I could test that ARM VMs impacted by memory errors on a large underlying memory page, can end up looping on reporting the error: The VM encountering an error has a high probability to crash and can try to save a vmcore with a kdump phase.
Yeah, that's what I thought. If you rip out 1 GiB of memory, your VM is going to have a bad time :/
This fix introduces qemu messages reporting errors when they are relayed to the VM. A large page being poisoned by an error on ARM can make a VM loop on the vmcore collection phase and the console would show messages like that appearing every 10 seconds (before the change): vvv Starting Kdump Vmcore Save Service... [ 3.095399] kdump[445]: Kdump is using the default log level(3). [ 3.173998] kdump[481]: saving to /sysroot/var/crash/127.0.0.1-2025-01-27-20:17:40/ [ 3.189683] kdump[486]: saving vmcore-dmesg.txt to /sysroot/var/crash/127.0.0.1-2025-01-27-20:17:40/ [ 3.213584] kdump[492]: saving vmcore-dmesg.txt complete [ 3.220295] kdump[494]: saving vmcore [ 10.029515] EDAC MC0: 1 UE unknown on unknown memory ( page:0x116c60 offset:0x0 grain:1 - APEI location: ) [ 10.033647] [Firmware Warn]: GHES: Invalid address in generic error data: 0x116c60000 [ 10.036974] {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0 [ 10.040514] {2}[Hardware Error]: event severity: recoverable [ 10.042911] {2}[Hardware Error]: Error 0, type: recoverable [ 10.045310] {2}[Hardware Error]: section_type: memory error [ 10.047666] {2}[Hardware Error]: physical_address: 0x0000000116c60000 [ 10.050486] {2}[Hardware Error]: error_type: 0, unknown [ 20.053205] EDAC MC0: 1 UE unknown on unknown memory ( page:0x116c60 offset:0x0 grain:1 - APEI location: ) [ 20.057416] [Firmware Warn]: GHES: Invalid address in generic error data: 0x116c60000 [ 20.060781] {3}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0 [ 20.065472] {3}[Hardware Error]: event severity: recoverable [ 20.067878] {3}[Hardware Error]: Error 0, type: recoverable [ 20.070273] {3}[Hardware Error]: section_type: memory error [ 20.072686] {3}[Hardware Error]: physical_address: 0x0000000116c60000 [ 20.075590] {3}[Hardware Error]: error_type: 0, unknown ^^^ with the fix, we now have a flood of messages like: vvv qemu-system-aarch64: Memory Error on large page from ram-node1:d5e00000+0 +200000 qemu-system-aarch64: Guest Memory Error at QEMU addr 0xffff35c79000 and GUEST addr 0x115e79000 of type BUS_MCEERR_AR injected qemu-system-aarch64: Memory Error on large page from ram-node1:d5e00000+0 +200000 qemu-system-aarch64: Guest Memory Error at QEMU addr 0xffff35c79000 and GUEST addr 0x115e79000 of type BUS_MCEERR_AR injected qemu-system-aarch64: Memory Error on large page from ram-node1:d5e00000+0 +200000 qemu-system-aarch64: Guest Memory Error at QEMU addr 0xffff35c79000 and GUEST addr 0x115e79000 of type BUS_MCEERR_AR injected ^^^ In both cases, this situation loops indefinitely ! I'm just informing of a change of behavior, fixing this issue would most probably require VM kernel modifications or a work-around in qemu when errors are reported too often, but is out of the scope of this current qemu fix.
Agreed. I think one problem is that kdump cannot really cope with new memory errors (it tries to not touch pages that had a memory error in the old kernel).
Maybe this is also due to the fact that we inform the kernel only about a single page vanishing, whereby actually a whole 1 GiB is vanishing.
-- Cheers, David / dhildenb