Re: [PATCH 1/3] kexec: Prevent removal of memory in use by a loaded kexec image

David Hildenbrand <david@xxxxxxxxxx> · Mon, 30 Mar 2020 15:13:28 +0200

> Adding a sentence about the way kexec load works may help, the first paragraph
> would read:
> 
> | Kexec allows user-space to specify the address that the kexec image should be
> | loaded to. Because this memory may be in use, an image loaded for kexec is not
> | stored in place, instead its segments are scattered through memory, and are
> | re-assembled when needed. In the meantime, the target memory may have been
> | removed.
> 
> Do you think thats clearer?

Yes, very much. Maybe add, that the target is described by user space
during kexec_load() and that user space - right now - parses /proc/iomem
to find applicable system memory.

> [...]
> 
>>> Load kexec:
>>> | root@vm:/sys/devices/system/memory# kexec -l /root/bzImage --reuse-cmdline
>>>
>>
>> I assume this will trigger
>>
>> kexec_load -> do_kexec_load -> kimage_load_segment ->
>> kimage_load_normal_segment -> kimage_alloc_page -> kimage_alloc_pages
>>
>> Which will just allocate a bunch of pages and mark them reserved.
>>
>> Now, AFAIKs, all allocations will be unmovable. So none of the kexec
>> segment allocations will actually end up on your DIMM (as it is onlined
>> online_movable).
>>
>> So, the loaded image (with its segments) from user won't be problematic
>> and not get placed on your DIMM.
>>
>>
>> Now, the problematic part is (via man kexec_load) "mem and memsz specify
>> a physical address range that is the target of the copy."
>>
>> So the place where the image will be "assembled" at when doing the
>> reboot. Understood :)
> 
> Yup.
> 
> [...]
> 
>> I wonder if we should instead make the "kexec -e" fail. It tries to
>> touch random system memory.
> 
> Heh, isn't touching random system memory what kexec does?!

Having a racy user interface that can trigger kernel crashes feels very
wrong. We should limit the impact.

> 
> Its all described to user-space as 'System RAM'. Teaching it to probe
> /sys/devices/memory/... would require a user-space change.

I think we should really rename hotplugged memory on all architectures.

Especially also relevant for virtio-mem/hyper-v balloon, where some
pieces of (hotplugged )memory blocks are partially unavailable and
should not be touched - accessing them results in unpredictable behavior
(e.g., crashes or discarded writes).

[...]

>> Will probably need some thought. But it will actually also bail out when
>> user space passes wrong physical memory addresses, instead of
>> triple-faulting silently.
> 
> With this change, the reboot(LINUX_REBOOT_CMD_KEXEC), call would fail. This
> thing doesn't usually return, so we're likely to trigger error-handling that has
> never run before.
> 
> (Last time I debugged one of these, it turned out kexec had taken the network
> interfaces down, meaning the nfsroot was no longer accessible)
> 
> How can user-space know whether kexec is going to succeed, or fail like this?
> Any loaded kexec kernel could secretly be in this broken state.
> 
> Can user-space know what caused this to become unreliable? (without reading the
> kernel source)
> 
> 
> Given kexec can be unloaded by user-space, I think its better to prevent us
> getting into the broken state, preferably giving the hint that kexec us using
> that memory. The user can 'kexec -u', then retry removing the memory.
> 
> I think forbidding the memory-offline is simpler for user-space to deal with.

I thought about this over the weekend, and I don't think it's the right
approach.

1. It's racy. If memory is getting offlined/unplugged just while user
space is about to trigger the kexec_load(), you end up with the very
same triple-fault.

2. It's semantically wrong. kexec does not need online memory ("managed
by the buddy"), but still you disallow offlining memory.

I would really much rather want to see user-space choosing boot memory
(e.g., renaming hotplugged memory on all architectures), and checking
during "kexec -e" if the selected memory is actually "there", before
trying to write to it.

-- 
Thanks,

David / dhildenb