Hi David, On 3/30/20 2:13 PM, David Hildenbrand wrote: >> Adding a sentence about the way kexec load works may help, the first paragraph >> would read: >> >> | Kexec allows user-space to specify the address that the kexec image should be >> | loaded to. Because this memory may be in use, an image loaded for kexec is not >> | stored in place, instead its segments are scattered through memory, and are >> | re-assembled when needed. In the meantime, the target memory may have been >> | removed. >> >> Do you think thats clearer? > > Yes, very much. Maybe add, that the target is described by user space > during kexec_load() and that user space - right now - parses /proc/iomem > to find applicable system memory. (I don't think x86 parses /proc/iomem anymore). I'll repost this patch with that expanded commit message, once we've agreed this is the right thing to do! >>> I wonder if we should instead make the "kexec -e" fail. It tries to >>> touch random system memory. >> >> Heh, isn't touching random system memory what kexec does?! > > Having a racy user interface that can trigger kernel crashes feels very > wrong. We should limit the impact. >> Its all described to user-space as 'System RAM'. Teaching it to probe >> /sys/devices/memory/... would require a user-space change. > > I think we should really rename hotplugged memory on all architectures. > > Especially also relevant for virtio-mem/hyper-v balloon, where some > pieces of (hotplugged )memory blocks are partially unavailable and > should not be touched - accessing them results in unpredictable behavior > (e.g., crashes or discarded writes). I'll need to look into these. I'd assume for KVM that virtio-mem can be brought back when its accessed ... its just going to be slow. >>> Will probably need some thought. But it will actually also bail out when >>> user space passes wrong physical memory addresses, instead of >>> triple-faulting silently. >> >> With this change, the reboot(LINUX_REBOOT_CMD_KEXEC), call would fail. This >> thing doesn't usually return, so we're likely to trigger error-handling that has >> never run before. >> >> (Last time I debugged one of these, it turned out kexec had taken the network >> interfaces down, meaning the nfsroot was no longer accessible) >> >> How can user-space know whether kexec is going to succeed, or fail like this? >> Any loaded kexec kernel could secretly be in this broken state. >> >> Can user-space know what caused this to become unreliable? (without reading the >> kernel source) >> >> >> Given kexec can be unloaded by user-space, I think its better to prevent us >> getting into the broken state, preferably giving the hint that kexec us using >> that memory. The user can 'kexec -u', then retry removing the memory. >> >> I think forbidding the memory-offline is simpler for user-space to deal with. > > I thought about this over the weekend, and I don't think it's the right > approach. > 1. It's racy. If memory is getting offlined/unplugged just while user > space is about to trigger the kexec_load(), you end up with the very > same triple-fault. load? How is this different to user-space providing a bogus address? Sure, user-space may take a nap between parsing /proc/iomem and calling kexec_load(), but the kernel should reject these as they would never work. (I can't see where sanity_check_segment_list() considers the platform's memory. If it doesn't, we should fix it) Once the image is loaded, and clashes with a request to remove the memory there are two choices: secretly unload the image, or prevent the memory being taken offline. > 2. It's semantically wrong. kexec does not need online memory ("managed > by the buddy"), but still you disallow offlining memory. It does need the memory if you want 'kexec -e' to succeed. If there were any sanity tests, they should have happened at load time. The memory is effectively in use by the loaded kexec image. User-space told the kernel to use this memory, you should not be able to then remove it, without unloading the kexec image first. Are you saying feeding bogus addresses to kexec_load() is _expected_ to blow up like this? > I would really much rather want to see user-space choosing boot memory > (e.g., renaming hotplugged memory on all architectures), and checking > during "kexec -e" if the selected memory is actually "there", before > trying to write to it. How does 'kexec -e' know where the kexec kernel was loaded? You'd need to pass something between 'load' and 'exec'. How do you keep existing user-space working as much as possible? What do you do if the memory isn't there? User-space just called reboot(), it would be better to avoid getting into the situation where we have to fail that call. Solving the bigger problem, would add a 'kexec_it_now' flag to the kexec_load() call. This would make the window where 'stuff' can change much smaller. Things changing while user-space sleeps isn't a solvable problem, these would need to be rejected by sanity tests at load time. Thanks, James