The Problem: ============ Post CPU/Memory hot plug/unplug and online/offline events the kernel holds stale information about the system. Dump collection with stale kdump kernel might end up in dump capture failure or an inaccurate dump collection. Existing solution: ================== The existing solution to keep the kdump kernel up-to-date by monitoring CPU/Memory hotplug/online/offline events via udev rule and trigger a full kdump kernel reload for every hotplug event. Shortcomings: ------------------------------------------------ - Leaves a window where kernel crash might not lead to a successful dump collection. - Reloading all kexec components for each hotplug is inefficient. - udev rules are prone to races if hotplug events are frequent. More about issues with an existing solution is posted here: - https://lkml.org/lkml/2020/12/14/532 - https://lists.ozlabs.org/pipermail/linuxppc-dev/2022-February/240254.html Proposed Solution: ================== Instead of reloading all kexec segments on CPU/Memory hotplug/online/offline event, this patch series focuses on updating only the relevant kexec segment. Once the kexec segments are loaded in the kernel reserved area then an arch-specific hotplug handler will update the relevant kexec segment based on hotplug event type. Series Dependencies ==================== This patch series implements the crash hotplug handler on PowerPC. The generic crash hotplug handler is introduced by https://lkml.org/lkml/2023/4/4/1136 patch series. Git tree for testing: ===================== The below git tree has this patch series applied on top of dependent patch series. https://github.com/sourabhjains/linux/tree/e21-s10 To realise the feature the kdump udev rule must updated to avoid reloading of kdump reload on CPU/Memory hotplug/online/offline events. RHEL: /usr/lib/udev/rules.d/98-kexec.rules -SUBSYSTEM=="cpu", ACTION=="online", GOTO="kdump_reload_cpu" -SUBSYSTEM=="memory", ACTION=="online", GOTO="kdump_reload_mem" -SUBSYSTEM=="memory", ACTION=="offline", GOTO="kdump_reload_mem" +SUBSYSTEM=="cpu", ATTRS{crash_hotplug}=="1", GOTO="kdump_reload_end" +SUBSYSTEM=="memory", ATTRS{crash_hotplug}=="1", GOTO="kdump_reload_end" Note: only kexec_file_load syscall will work. For kexec_load minor changes are required in kexec tool. --- Changelog: v10: - Drop the patch that adds fdt_index attribute to struct kimage_arch Find the fdt segment index when needed. - Added more details into commits messages. - Rebased onto 6.3.0-rc5 v9: - Removed patch to prepare elfcorehdr crash notes for possible CPUs. The patch is moved to generic patch series that introduces generic infrastructure for in kernel crash update. - Removed patch to pass the hotplug action type to the arch crash hotplug handler function. The generic patch series has introduced the hotplug action type in kimage struct. - Add detail commit message for better understanding. v8: - Restrict fdt_index initialization to machine_kexec_post_load it work for both kexec_load and kexec_file_load.[3/8] Laurent Dufour - Updated the logic to find the number of offline core. [6/8] - Changed the logic to find the elfcore program header to accommodate future memory ranges due memory hotplug events. [8/8] v7 - added a new config to configure this feature - pass hotplug action type to arch specific handler v6 - Added crash memory hotplug support v5: - Replace COFNIG_CRASH_HOTPLUG with CONFIG_HOTPLUG_CPU. - Move fdt segment identification for kexec_load case to load path instead of crash hotplug handler - Keep new attribute defined under kimage_arch to track FDT segment under CONFIG_HOTPLUG_CPU config. v4: - Update the logic to find the additional space needed for hotadd CPUs post kexec load. Refer "[RFC v4 PATCH 4/5] powerpc/crash hp: add crash hotplug support for kexec_file_load" patch to know more about the change. - Fix a couple of typo. - Replace pr_err to pr_info_once to warn user about memory hotplug support. - In crash hotplug handle exit the for loop if FDT segment is found. v3 - Move fdt_index and fdt_index_vaild variables to kimage_arch struct. - Rebase patche on top of https://lkml.org/lkml/2022/3/3/674 [v5] - Fixed warning reported by checpatch script v2: - Use generic hotplug handler introduced by https://lkml.org/lkml/2022/2/9/1406, a significant change from v1. Sourabh Jain (5): powerpc/kexec: turn some static helper functions public powerpc/crash: introduce a new config option CRASH_HOTPLUG powerpc/crash: add crash CPU hotplug support crash: forward memory_notify args to arch crash hotplug handler powerpc/kexec: add crash memory hotplug support arch/powerpc/Kconfig | 12 + arch/powerpc/include/asm/kexec.h | 10 + arch/powerpc/include/asm/kexec_ranges.h | 1 + arch/powerpc/kexec/core_64.c | 301 ++++++++++++++++++++++++ arch/powerpc/kexec/elf_64.c | 12 +- arch/powerpc/kexec/file_load_64.c | 212 ++++------------- arch/powerpc/kexec/ranges.c | 85 +++++++ arch/x86/include/asm/kexec.h | 2 +- arch/x86/kernel/crash.c | 3 +- include/linux/kexec.h | 2 +- kernel/crash_core.c | 14 +- 11 files changed, 479 insertions(+), 175 deletions(-) -- 2.39.2 _______________________________________________ kexec mailing list kexec@xxxxxxxxxxxxxxxxxxx http://lists.infradead.org/mailman/listinfo/kexec