Re: [RFC v1 0/8] RFC v1: Kernel handling of CPU and memory hot un/plug for crash

Eric DeVolder <eric.devolder@xxxxxxxxxx> · Mon, 29 Nov 2021 14:00:17 -0600

Hi, see below. eric

On 11/29/21 02:45, Sourabh Jain wrote:
Hello Eric,

On 18/11/21 23:19, Eric DeVolder wrote:
When the kdump service is loaded, if a CPU or memory is hot
un/plugged, the crash elfcorehdr which describes the CPUs and memory
in the system, must also be updated, else the resulting vmcore is
inaccurate (eg. missing either CPU context or memory regions).

The current solution utilizes udev to initiate an unload-then-reload
of the kdump image (e. kernel, initrd, boot_params, puratory and
elfcorehdr) by the userspace kexec utility.

In the post https://lkml.org/lkml/2020/12/14/532 I outlined two
problems with this userspace-initiated unload-then-reload approach as
it pertains to supporting CPU and memory hot un/plug for kdump.
(Note in that post, I erroneously called the elfcorehdr the vmcoreinfo
structure. There is a vmcoreinfo structure, but it has a different
purpose. So in that post substitute "elfcorehdr" for "vmcoreinfo".)

The first problem being the time needed to complete the unload-then-
reload of the kdump image, and the second being the effective race
window that unload-then-reload effort creates.

The scenario I measured was a 32GiB guest being resized to 512GiB and
observing it took over 4 minutes for udev to "settle down" and
complete the unload-then-reload of the resulting 3840 hot plug events.
Empirical evidence within our fleet substantiates this problem.

Each unload-then-reload creates a race window the size of which is the
time it takes to reload the complete kdump image. Within the race
window, kdump is not loaded and should a panic occur, the kernel halts
rather than dumping core via kdump.

This patchset significantly improves upon the current solution by
enabling the kernel to update only the necessary items of the kdump
image. In the case of x86_64, that is just the elfcorehdr and the
purgatory segments. These updates occur as fast as the hot un/plug
events and significantly reduce the size of the race window.

This patchset introduces a generic crash hot un/plug handler that
registers with the CPU and memory notifiers. Upon CPU or memory
changes, this generic handler is invoked and performs important
housekeeping, for example obtaining the appropriate lock, and then
invokes an architecture specific handler to do the appropriate
updates.

In the case of x86_64, the arch specific handler generates a new
elfcorehdr, which reflects the current CPUs and memory regions, into a
buffer. Since purgatory also does an integrity check via hash digests
of the loaded segments, purgatory must also be updated with the new
digests. The arch handler also generates a new purgatory into a
buffer, performs the hash digests of the new memory segments, and then
patches purgatory with the new digests.  If all succeeds, then the
elfcorehdr and purgatory buffers over write the existing buffers and
the new kdump image is live and ready to go. No involvement with
userspace at all.

To accommodate a growing number of resources via hotplug, the
elfcorehdr memory must be sufficiently large enough to accommodate
changes. The CRASH_HOTPLUG_ELFCOREHDR_SZ configure item does just
this.

To realize the benefits/test this patchset, one must make a couple
of minor changes to userspace:

  - Disable the udev rule for updating kdump on hot un/plug changes
    Eg. on RHEL: rm -f /usr/lib/udev/rules.d/98-kexec.rules
    or other technique to neuter the rule.

  - Change to the kexec_file_load for loading the kdump kernel:
    Eg. on RHEL: in /usr/bin/kdumpctl, change to:
     standard_kexec_args="-p -d -s"
    which adds the -s to select kexec_file_load syscall.

This work has raised the following questions for me:

First and foremost, this patch only works for the kexec_file_load
syscall path (via "kexec -s -p" utility). The reason being that, for
x86_64 anyway, the purgatory blob provided by userspace can not be
readily decoded in order to update the hash digest. (The
kexec_file_load purgatory is actually a small ELF object with symbols,
so can be patched at run time.) With no way to update purgatory, the
integrity check will always fail and and cause purgatory to hang at
panic time.

We are designing a solution for a similar problem in PowerPC. Agree that
manipulating kexec segment in the kernel for kexec_load system call is
bit complex compare to kexec_file_load system call due to SHA verification
in purgatory.

What if we have a pre-allocated memory hole for the kexec segment
and ask kexec to use that and skip the SHA verification for the same.
For example, on PowerPC, all the CPUs and memory-related info is part
of FDT. Whenever there is hotplug event we have to update the kdump
  FDT segment to provide correct details to the kdump kernel.

  One way to keep the kdump FDT up-to-date with the latest CPU and memory
is to load the kdump FDT to the pre-allocated memory hole for both kexec_load
and kexec_file_laod system call and let the kernel keep updating the FDT
on hotplug event.

Adapting the above solution for the kexec_file_load case is easy because
we do everything in the kernel. But what about the kexec_load system call? How
kexec tool will know about this pre-allocated memory hole? What will happen
to digest verification if the kernel updates the kdump FDT segment post kdump
load?

The kernel will expose the pre-allocated memory to userspace via a sysfs. When kexec
tool loads the kexec segments it will check for this pre-allocated memory for
kdump FDT and if available it will use it and skip the SHA verification
for the same.

  Please provide your input on the above method of handling things for the kexec_file system call?

While I am not at all familiar with PPC FDT; this sounds quite doable; the pre-allocated
memory for FDT sounds quite similar to the handling of the crashkernel= parameter.

From the description provided above, it sounds to me that excluding (from the purgatory
integrity check) the PPC FDT would be quite similar to what Baoquan is proposing by
excluding (from the purgatory integrity check) the elfcorehdr for x86.

If we can achieve a consensus on excluding from the purgatory check the elfcorehdr (for
x86) and the FDT (for PPC), then I believe that support for kexec_load and hotplug is
readily achievable.

eric

  I am still reviewing your patch series.

Thanks,
Sourabh Jain

That being said, I actually developed this against the kexec_load path
and did have that working by making two one-line changes to userspace
kexec utility: one change that effectively is
CRASH_HOTPLUG_ELFCOREHDR_SZ and the other to disable the integrity
check. But that does not seem to be a long term solution. A possible
long term solution would be to allow the use of the kexec_file_load
purgatory ELF object with the kexec_load path. While I believe that
would work, I am unsure if there are any downsides to doing so.

The second problem is the use of CPUHP_AP_ONLINE_DYN.  The
cpuhp_setup_state_nocalls() is invoked with parameter
CPUHP_AP_ONLINE_DYN. While this works, when a CPU is being unplugged,
the CPU still shows up in foreach_present_cpu() during the
regeneration of the elfcorehdr, thus the need to explicitly check and
exclude the soon-to-be offlined CPU in crash_prepare_elf64_headers().
Perhaps if value(s) new/different than CPUHP_AP_ONLINE_DYN to
cpuhp_setup_state() was utilized, then the offline cpu would no longer
be in foreach_present_cpu(), and this change could be eliminated. I do
not understand cpuhp_setup_state() well enough to choose, or create,
appropriate value(s).

The third problem is the number of memory hot un/plug events.  If, for
example, a 1GiB DIMM is hotplugged, that generate 8 memory events, one
for each 128MiB memblock, yet the walk_system_ram_res() that is used
to obtain the list of memory regions reports the single 1GiB; thus
there are 7 un-necessary trips through this crash hotplug handler.
Perhaps there is another way of handling memory events that would see
the single 1GiB DIMM rather than each memblock?

Regards,
eric

Eric DeVolder (8):
   crash: fix minor typo/bug in debug message
   crash hp: Introduce CRASH_HOTPLUG configuration options
   crash hp: definitions and prototypes for crash hotplug support
   crash hp: generic crash hotplug support infrastructure
   crash hp: kexec_file changes for use by crash hotplug handler
   crash hp: Add x86 crash hotplug state items to kimage
   crash hp: Add x86 crash hotplug support for kexec_file_load
   crash hp: Add x86 crash hotplug support for bzImage

  arch/x86/Kconfig                  |  26 +++
  arch/x86/include/asm/kexec.h      |  10 ++
  arch/x86/kernel/crash.c           | 257 +++++++++++++++++++++++++++++-
  arch/x86/kernel/kexec-bzimage64.c |  12 ++
  include/linux/kexec.h             |  22 ++-
  kernel/crash_core.c               | 118 ++++++++++++++
  kernel/kexec_file.c               |  19 ++-
  7 files changed, 455 insertions(+), 9 deletions(-)

_______________________________________________
kexec mailing list
kexec@xxxxxxxxxxxxxxxxxxx
http://lists.infradead.org/mailman/listinfo/kexec