> Adding a sentence about the way kexec load works may help, the first paragraph > would read: > > | Kexec allows user-space to specify the address that the kexec image should be > | loaded to. Because this memory may be in use, an image loaded for kexec is not > | stored in place, instead its segments are scattered through memory, and are > | re-assembled when needed. In the meantime, the target memory may have been > | removed. > > Do you think thats clearer? Yes, very much. Maybe add, that the target is described by user space during kexec_load() and that user space - right now - parses /proc/iomem to find applicable system memory. > [...] > >>> Load kexec: >>> | root@vm:/sys/devices/system/memory# kexec -l /root/bzImage --reuse-cmdline >>> >> >> I assume this will trigger >> >> kexec_load -> do_kexec_load -> kimage_load_segment -> >> kimage_load_normal_segment -> kimage_alloc_page -> kimage_alloc_pages >> >> Which will just allocate a bunch of pages and mark them reserved. >> >> Now, AFAIKs, all allocations will be unmovable. So none of the kexec >> segment allocations will actually end up on your DIMM (as it is onlined >> online_movable). >> >> So, the loaded image (with its segments) from user won't be problematic >> and not get placed on your DIMM. >> >> >> Now, the problematic part is (via man kexec_load) "mem and memsz specify >> a physical address range that is the target of the copy." >> >> So the place where the image will be "assembled" at when doing the >> reboot. Understood :) > > Yup. > > [...] > >> I wonder if we should instead make the "kexec -e" fail. It tries to >> touch random system memory. > > Heh, isn't touching random system memory what kexec does?! Having a racy user interface that can trigger kernel crashes feels very wrong. We should limit the impact. > > Its all described to user-space as 'System RAM'. Teaching it to probe > /sys/devices/memory/... would require a user-space change. I think we should really rename hotplugged memory on all architectures. Especially also relevant for virtio-mem/hyper-v balloon, where some pieces of (hotplugged )memory blocks are partially unavailable and should not be touched - accessing them results in unpredictable behavior (e.g., crashes or discarded writes). [...] >> Will probably need some thought. But it will actually also bail out when >> user space passes wrong physical memory addresses, instead of >> triple-faulting silently. > > With this change, the reboot(LINUX_REBOOT_CMD_KEXEC), call would fail. This > thing doesn't usually return, so we're likely to trigger error-handling that has > never run before. > > (Last time I debugged one of these, it turned out kexec had taken the network > interfaces down, meaning the nfsroot was no longer accessible) > > How can user-space know whether kexec is going to succeed, or fail like this? > Any loaded kexec kernel could secretly be in this broken state. > > Can user-space know what caused this to become unreliable? (without reading the > kernel source) > > > Given kexec can be unloaded by user-space, I think its better to prevent us > getting into the broken state, preferably giving the hint that kexec us using > that memory. The user can 'kexec -u', then retry removing the memory. > > I think forbidding the memory-offline is simpler for user-space to deal with. I thought about this over the weekend, and I don't think it's the right approach. 1. It's racy. If memory is getting offlined/unplugged just while user space is about to trigger the kexec_load(), you end up with the very same triple-fault. 2. It's semantically wrong. kexec does not need online memory ("managed by the buddy"), but still you disallow offlining memory. I would really much rather want to see user-space choosing boot memory (e.g., renaming hotplugged memory on all architectures), and checking during "kexec -e" if the selected memory is actually "there", before trying to write to it. -- Thanks, David / dhildenb