Hi David, On 3/27/20 6:52 PM, David Hildenbrand wrote: >>> 2. You do the kexec. The kexec kernel will only operate on a reserved >>> memory region (reserved via e.g., kernel cmdline crashkernel=128M). >> >> I think you are merging the kexec and kdump behaviours. >> (Wrong terminology? The things behind 'kexec -l Image' and 'kexec -p Image') > > Oh, I see - I think your example below clarifies things. Something like > that should go in the cover letter if we end up in this patch being > required :) Do you mean the commit message? I think its far too long... Adding a sentence about the way kexec load works may help, the first paragraph would read: | Kexec allows user-space to specify the address that the kexec image should be | loaded to. Because this memory may be in use, an image loaded for kexec is not | stored in place, instead its segments are scattered through memory, and are | re-assembled when needed. In the meantime, the target memory may have been | removed. Do you think thats clearer? > (I missed that the problematic part is "random" addresses passed by user > space to the kernel, where it wants data to be loaded to on kexec -e) [...] >> Load kexec: >> | root@vm:/sys/devices/system/memory# kexec -l /root/bzImage --reuse-cmdline >> > > I assume this will trigger > > kexec_load -> do_kexec_load -> kimage_load_segment -> > kimage_load_normal_segment -> kimage_alloc_page -> kimage_alloc_pages > > Which will just allocate a bunch of pages and mark them reserved. > > Now, AFAIKs, all allocations will be unmovable. So none of the kexec > segment allocations will actually end up on your DIMM (as it is onlined > online_movable). > > So, the loaded image (with its segments) from user won't be problematic > and not get placed on your DIMM. > > > Now, the problematic part is (via man kexec_load) "mem and memsz specify > a physical address range that is the target of the copy." > > So the place where the image will be "assembled" at when doing the > reboot. Understood :) Yup. [...] > I wonder if we should instead make the "kexec -e" fail. It tries to > touch random system memory. Heh, isn't touching random system memory what kexec does?! Its all described to user-space as 'System RAM'. Teaching it to probe /sys/devices/memory/... would require a user-space change. > Denying to offline MOVABLE memory should be avoided - and what kexec > does here sounds dangerous to me (allowing it to write random system > memory). > Roughly what I am thinking is this: > > diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c > index ba1d91e868ca..70c39a5307e5 100644 > --- a/kernel/kexec_core.c > +++ b/kernel/kexec_core.c > @@ -1135,6 +1135,10 @@ int kernel_kexec(void) > error = -EINVAL; > goto Unlock; > } > + if (!kexec_image_validate()) { > + error = -EINVAL; > + goto Unlock; > + } > > #ifdef CONFIG_KEXEC_JUMP > if (kexec_image->preserve_context) { > > > kexec_image_validate() would go over all segments and validate that the > involved pages are actual valid memory (pfn_to_online_page()). > > All we have to do is protect from memory hotplug until we switch to the > new kernel. (migrate_to_reboot_cpu() can sleep), I think you'd end up with something like this patch, but only while kexec_in_progress. I don't think letting kexec fail if the events occur in a different order is good for user-space. > Will probably need some thought. But it will actually also bail out when > user space passes wrong physical memory addresses, instead of > triple-faulting silently. With this change, the reboot(LINUX_REBOOT_CMD_KEXEC), call would fail. This thing doesn't usually return, so we're likely to trigger error-handling that has never run before. (Last time I debugged one of these, it turned out kexec had taken the network interfaces down, meaning the nfsroot was no longer accessible) How can user-space know whether kexec is going to succeed, or fail like this? Any loaded kexec kernel could secretly be in this broken state. Can user-space know what caused this to become unreliable? (without reading the kernel source) Given kexec can be unloaded by user-space, I think its better to prevent us getting into the broken state, preferably giving the hint that kexec us using that memory. The user can 'kexec -u', then retry removing the memory. I think forbidding the memory-offline is simpler for user-space to deal with. Thanks, James