On 23.04.20 18:29, Eric W. Biederman wrote: > David Hildenbrand <david@xxxxxxxxxx> writes: > >>> The confusing part was talking about memory being still in use, >>> that is actually scheduled for use in the future. >> >> +1 >> >>> >>>>> Usually somewhere in the loaded image >>>>> is a copy of the memory map at the time the kexec kernel was loaded. >>>>> That will invalidate the memory map as well. >>>> >>>> Ah, unconditionally. Sure, x86 needs this. >>>> (arm64 re-discovers the memory map from firmware tables after kexec) >> >> Does this include hotplugged DIMMs e.g., under KVM? >> [...] > > As far as I know. If the memory map changes we need to drop the loaded > image. > > > Having thought about it a little more I suspect it would be the > other way and just block all hotplug actions after a kexec_load. > As all we expect to happen is running shutdown scripts. > > If blocking the hotplug action uses printk to print a nice message > saying something like: "Hotplug blocked because of a loaded kexec image", > then people will be able to figure out what is going on and > call kexec -u if they haven't started the shutdown scripts yet. > > > Either way it is something simple and unconditional that will make > things work. > Personally, I consider memory hotplug more important than keeping loaded kexec data alive (just because somebody once decided to do a "kexec -l" and never did a "kexec -e" we should not block any memory hot(un)plug - especially in virtualized environments - for all eternity). So IMHO we would invalidate loaded kexec data (not the crashkernel, of course) on memory hot(un)plug and print a warning. In addition, we can let kexec-tools try to reload whatever they loaded after getting notified that something changed. The "something changed" is visible to user space e.g., via udev events for /sys/devices/memory/memoryX/ >>>>> All of this should be for a very brief window of a few seconds, as >>>>> the loaded kexec image is quite short. >>>> >>>> It seems I'm the outlier anticipating anything could happen between >>>> those syscalls. >>> >>> The design is: >>> sys_kexec_load() >>> shutdown scripts >>> sys_reboot(LINUX_REBOOT_CMD_KEXEC); >>> >>> There are two system call simply so that the shutdown scripts can run. >>> Now maybe someone somewhere does something different but that is not >>> expected. >>> >>> Only the kexec on panic kernel is expected to persist somewhat >>> indefinitely. But that should be in memory that is reserved from boot >>> time, and so the memory hotplug should have enough visibility to not >>> allow that memory to be given up. >> >> Yes, and AFAIK, memory blocks which hold the reserved crashkernel area >> can usually not get offlined and, therefore, the memory cannot get removed. >> >> Interestingly, s390x even has a hotplug notifier for that >> >> arch/s390/kernel/setup.c:kdump_mem_notifier() >> >> (offlining of memory on s390x can result in memory getting depopulated >> in the hypervisor, so after it would have been offlined, it would no >> longer be accessible. I somewhat doubt that this notifier is really >> needed - all pages in the crashkernel area should look like ordinary >> allocated pages when the area is reserved early during boot via the >> memblock allocator, and therefore offlining cannot succeed. But that's a >> different story - and I suspect this is a leftover from pre-memblock times.) > > It might be worth seeing if that is true, or if we need to generalize the > s390x code. I'll try to find some time to test if the s390x handler is still relevant. -- Thanks, David / dhildenb