Hi David, On 3/27/20 5:06 PM, David Hildenbrand wrote: > On 27.03.20 17:56, James Morse wrote: >> On 3/27/20 9:30 AM, David Hildenbrand wrote: >>> On 26.03.20 19:07, James Morse wrote: >>>> An image loaded for kexec is not stored in place, instead its segments >>>> are scattered through memory, and are re-assembled when needed. In the >>>> meantime, the target memory may have been removed. >>>> >>>> Because mm is not aware that this memory is still in use, it allows it >>>> to be removed. >>>> >>>> Add a memory notifier to prevent the removal of memory regions that >>>> overlap with a loaded kexec image segment. e.g., when triggered from the >>>> Qemu console: >>>> | kexec_core: memory region in use >>>> | memory memory32: Offline failed. >>>> diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c >>>> index c19c0dad1ebe..ba1d91e868ca 100644 >>>> --- a/kernel/kexec_core.c >>>> +++ b/kernel/kexec_core.c >> >>> E.g., in kernel/kexec_core.c:kimage_alloc_pages() >>> >>> "SetPageReserved(pages + i);" >>> >>> Pages that are reserved cannot get offlined. How are you able to trigger >>> that before this patch? (where is the allocation path for kexec, which >>> will not set the pages reserved?) >> >> This sets page reserved on the memory it gets back from >> alloc_pages() in kimage_alloc_pages(). This is when you load the image[0]. >> >> The problem I see is for the target or destination memory once you execute the >> image. Once machine_kexec() runs, it tries to write to this, assuming it is >> still present... > Let's recap > > 1. You load the image. You allocate memory for e.g., the kexec kernel. > The pages will be marked PG_reserved, so they cannot be offlined. > > 2. You do the kexec. The kexec kernel will only operate on a reserved > memory region (reserved via e.g., kernel cmdline crashkernel=128M). I think you are merging the kexec and kdump behaviours. (Wrong terminology? The things behind 'kexec -l Image' and 'kexec -p Image') For kdump, yes, the new kernel is loaded into the crashkernel reservation, and confined to it. For regular kexec, the new kernel can be loaded any where in memory. There might be a difference with how this works on arm64.... The regular kexec kernel isn't stored in its final location when its loaded, its relocated there when the image is executed. The target/destination memory may have been removed in the meantime. (an example recipe below should clarify this) > Is it that in 2., the reserved memory region (for the crashkernel) could > have been offlined in the meantime? No, for kdump: the crashkernel reservation is PG_reserved, and its not something mm knows how to move, so that region can't be taken offline. (On arm64 we additionally prevent the boot-memory from being removed as it is all described as present by UEFI. The crashkernel reservation would always be from this type of memory) This is about a regular kexec, any crashdump reservation is irrelevant. This kexec kernel is temporarily stored out of line, then relocated when executed. A recipe so that we're at least on the same terminal! This is on a TX2 running arm64's for-next/core using Qemu-TCG to emulate x86. (Sorry for the bizarre config, its because Qemu supports hotremove on x86, but not yet on arm64). Insert the memory: (qemu) object_add memory-backend-ram,id=mem1,size=1G (qemu) device_add pc-dimm,id=dimm1,memdev=mem1 | root@vm:~# free -m | total used free shared ... | Mem: 918 52 814 0 ... | Swap: 0 0 0 Bring it online: | root@vm:~# cd /sys/devices/system/memory/ | root@vm:/sys/devices/system/memory# for F in memory3*; do echo \ | online_movable > $F/state; done | Built 1 zonelists, mobility grouping on. Total pages: 251049 | Policy zone: DMA32 | -bash: echo: write error: Invalid argument | root@vm:/sys/devices/system/memory# free -m | total used free shared ... | Mem: 1942 53 1836 0 ... | Swap: 0 0 0 Load kexec: | root@vm:/sys/devices/system/memory# kexec -l /root/bzImage --reuse-cmdline Press the Attention button to request removal: (qemu) device_del dimm1 | Offlined Pages 32768 | Offlined Pages 32768 | Offlined Pages 32768 | Offlined Pages 32768 | Offlined Pages 32768 | Offlined Pages 32768 | Offlined Pages 32768 | Offlined Pages 32768 | Built 1 zonelists, mobility grouping on. Total pages: 233728 | Policy zone: DMA32 The memory is gone: | root@vm:/sys/devices/system/memory# free -m | total used free shared ... | Mem: 918 89 769 0 ... | Swap: 0 0 0 Trigger kexec: | root@vm:/sys/devices/system/memory# kexec -e [...] | sd 0:0:0:0: [sda] Synchronizing SCSI cache | kexec_core: Starting new kernel ... and Qemu restarts the platform firmware instead of proceeding with kexec. (I assume this is a triple fault) You can use mem-min and mem-max to control where kexec's user space will place the memory. If you apply this patch, the above sequence will fail at the device remove step, as the physical addresses match the loaded kexec image: | Offlined Pages 32768 | Offlined Pages 32768 | Offlined Pages 32768 | Offlined Pages 32768 | Offlined Pages 32768 | Offlined Pages 32768 | Offlined Pages 32768 | kexec_core: Memory region in use | kexec_core: Memory region in use | memory memory39: Offline failed. | Built 1 zonelists, mobility grouping on. Total pages: 299212 | Policy zone: Normal | root@vm:/sys/devices/system/memory# free -m | total used free shared ... | Mem: 1942 90 1793 0 ... | Swap: 0 0 0 I can't remove the DIMM, because we failed to offline it: (qemu) object_del mem1 object 'mem1' is in use, can not be deleted and I can trigger kexec and boot the new kernel. kexec user-space here comes from debian bullseye. It picked the removable memory all by itself without any additional arguments. (a different issue that can be ignored for now: x86 additionally fails to reboot if I remove memory, even if its not in use by the kexec image. This doesn't cause qemu to reboot via firmware, I think it dies before the console. It doesn't happen on arm64. I suspect the memory map is snapshotted and assumed to still be correct when the image is executed.) Thanks, James