Hello Eric,
On 13/03/23 21:12, Eric DeVolder wrote:
On 3/12/23 13:11, Sourabh Jain wrote:
The Problem:
============
Post hotplug/DLPAR events the capture kernel holds stale information
about the
system. Dump collection with stale capture kernel might end up in
dump capture
failure or an inaccurate dump collection.
Existing solution:
==================
The existing solution to keep the capture kernel up-to-date by
monitoring
hotplug event via udev rule and trigger a full capture kernel reload for
every hotplug event.
Shortcomings:
------------------------------------------------
- Leaves a window where kernel crash might not lead to a successful dump
collection.
- Reloading all kexec components for each hotplug is inefficient.
- udev rules are prone to races if hotplug events are frequent.
More about issues with an existing solution is posted here:
- https://lkml.org/lkml/2020/12/14/532
-
https://lists.ozlabs.org/pipermail/linuxppc-dev/2022-February/240254.html
Proposed Solution:
==================
Instead of reloading all kexec segments on hotplug event, this patch
series
focuses on updating only the relevant kexec segment. Once the kexec
segments
are loaded in the kernel reserved area then an arch-specific hotplug
handler
will update the relevant kexec segment based on hotplug event type.
Series Dependecies
==================
This patch series implements the crash hotplug handler on PowerPC.
The generic
crash hotplug update is introduced by
https://lkml.org/lkml/2023/3/6/1358 patch
series.
Git tree for testing:
=====================
The below git tree has this patch series applied on top of dependent
patch
series.
https://github.com/sourabhjains/linux/tree/in-kernel-crash-update-v9
To realise the feature the kdump udev rules must be disabled for
CPU/Memory
hotplug events. Comment out the below line in kdump udev rule file:
RHEL: /usr/lib/udev/rules.d/98-kexec.rules
#SUBSYSTEM=="cpu", ACTION=="online", GOTO="kdump_reload_cpu"
#SUBSYSTEM=="memory", ACTION=="online", GOTO="kdump_reload_mem"
#SUBSYSTEM=="memory", ACTION=="offline", GOTO="kdump_reload_mem"
SLES: /usr/lib/kdump/70-kdump.rules
#SUBSYSTEM=="memory", ACTION=="add|remove", GOTO="kdump_try_restart"
#SUBSYSTEM=="cpu", ACTION=="online", GOTO="kdump_try_restart"
Sourabh,
The above seems to contradict what I anticipate to be udev rules
changes once the base series is accepted. Specifically I'm suggesting
the following:
- Prevent udev from updating kdump crash kernel on hot un/plug changes.
Add the following as the first lines to the RHEL udev rule file
/usr/lib/udev/rules.d/98-kexec.rules:
# The kernel handles updates to crash elfcorehdr for cpu and memory
changes
SUBSYSTEM=="cpu", ATTRS{crash_hotplug}=="1", GOTO="kdump_reload_end"
SUBSYSTEM=="memory", ATTRS{crash_hotplug}=="1",
GOTO="kdump_reload_end"
With this changeset applied, the two rules evaluate to false for
cpu and memory change events and thus skip the userspace
unload-then-reload of kdump.
The above additions allow distros to deploy the udev rule immediately
and work properly even if the base patchset isn't yet merged, or down
the road, enabled/configured.
Am I missing something such that your recommendation is different than
mine? ]
It is just for the test I have been suggesting to disable the udev
rules, but your udev rules changes is the way forward.
I will use the above changes to control kdump service reload.
Note: only kexec_file_load syscall will work. For kexec_load minor
changes are required in kexec tool.
Will this be the same/similar change as I have posted, or do you
envision something different?
I think the generic changes will be same. I might need to add some
PowerPC specific changes to
make sure elfcorehdr and FDT kexec segment should have additional buffer
space to accommodate
additional memory ranges.
Thanks for the suggestion, I will align the PowerPC kexec tool changes
with your changes.
- Souarbh
_______________________________________________
kexec mailing list
kexec@xxxxxxxxxxxxxxxxxxx
http://lists.infradead.org/mailman/listinfo/kexec