On 25/02/23 01:46, Eric DeVolder wrote:
On 2/24/23 02:34, Sourabh Jain wrote:
On 24/02/23 02:04, Eric DeVolder wrote:
On 2/10/23 00:29, Sourabh Jain wrote:
On 10/02/23 01:09, Eric DeVolder wrote:
On 2/9/23 12:43, Sourabh Jain wrote:
Hello Eric,
On 09/02/23 23:01, Eric DeVolder wrote:
On 2/8/23 07:44, Thomas Gleixner wrote:
Eric!
On Tue, Feb 07 2023 at 11:23, Eric DeVolder wrote:
On 2/1/23 05:33, Thomas Gleixner wrote:
So my latest solution is introduce two new CPUHP states,
CPUHP_AP_ELFCOREHDR_ONLINE
for onlining and CPUHP_BP_ELFCOREHDR_OFFLINE for offlining.
I'm open to better names.
The CPUHP_AP_ELFCOREHDR_ONLINE needs to be placed after
CPUHP_BRINGUP_CPU. My
attempts at locating this state failed when inside the
STARTING section, so I located
this just inside the ONLINE sectoin. The crash hotplug handler
is registered on
this state as the callback for the .startup method.
The CPUHP_BP_ELFCOREHDR_OFFLINE needs to be placed before
CPUHP_TEARDOWN_CPU, and I
placed it at the end of the PREPARE section. This crash
hotplug handler is also
registered on this state as the callback for the .teardown
method.
TBH, that's still overengineered. Something like this:
bool cpu_is_alive(unsigned int cpu)
{
struct cpuhp_cpu_state *st = per_cpu_ptr(&cpuhp_state, cpu);
return data_race(st->state) <= CPUHP_AP_IDLE_DEAD;
}
and use this to query the actual state at crash time. That
spares all
those callback heuristics.
I'm making my way though percpu crash_notes, elfcorehdr,
vmcoreinfo,
makedumpfile and (the consumer of it all) the userspace crash
utility,
in order to understand the impact of moving from
for_each_present_cpu()
to for_each_online_cpu().
Is the packing actually worth the trouble? What's the actual win?
Thanks,
tglx
Thomas,
I've investigated the passing of crash notes through the vmcore.
What I've learned is that:
- linux/fs/proc/vmcore.c (which makedumpfile references to do
its job) does
not care what the contents of cpu PT_NOTES are, but it does
coalesce them together.
- makedumpfile will count the number of cpu PT_NOTES in order to
determine its
nr_cpus variable, which is reported in a header, but otherwise
unused (except
for sadump method).
- the crash utility, for the purposes of determining the cpus,
does not appear to
reference the elfcorehdr PT_NOTEs. Instead it locates the various
cpu_[possible|present|online]_mask and computes nr_cpus from
that, and also of
course which are online. In addition, when crash does
reference the cpu PT_NOTE,
to get its prstatus, it does so by using a percpu technique
directly in the vmcore
image memory, not via the ELF structure. Said differently, it
appears to me that
crash utility doesn't rely on the ELF PT_NOTEs for cpus;
rather it obtains them
via kernel cpumasks and the memory within the vmcore.
With this understanding, I did some testing. Perhaps the most
telling test was that I
changed the number of cpu PT_NOTEs emitted in the
crash_prepare_elf64_headers() to just 1,
hot plugged some cpus, then also took a few offline sparsely via
chcpu, then generated a
vmcore. The crash utility had no problem loading the vmcore, it
reported the proper number
of cpus and the number offline (despite only one cpu PT_NOTE),
and changing to a different
cpu via 'set -c 30' and the backtrace was completely valid.
My take away is that crash utility does not rely upon ELF cpu
PT_NOTEs, it obtains the
cpu information directly from kernel data structures. Perhaps at
one time crash relied
upon the ELF information, but no more. (Perhaps there are other
crash dump analyzers
that might rely on the ELF info?)
So, all this to say that I see no need to change
crash_prepare_elf64_headers(). There
is no compelling reason to move away from
for_each_present_cpu(), or modify the list for
online/offline.
Which then leaves the topic of the cpuhp state on which to
register. Perhaps reverting
back to the use of CPUHP_BP_PREPARE_DYN is the right answer.
There does not appear to
be a compelling need to accurately track whether the cpu went
online/offline for the
purposes of creating the elfcorehdr, as ultimately the crash
utility pulls that from
kernel data structures, not the elfcorehdr.
I think this is what Sourabh has known and has been advocating
for an optimization
path that allows not regenerating the elfcorehdr on cpu changes
(because all the percpu
structs are all laid out). I do think it best to leave that as
an arch choice.
Since things are clear on how the PT_NOTES are consumed in kdump
kernel [fs/proc/vmcore.c],
makedumpfile, and crash tool I need your opinion on this:
Do we really need to regenerate elfcorehdr for CPU hotplug events?
If yes, can you please list the elfcorehdr components that
changes due to CPU hotplug.
Due to the use of for_each_present_cpu(), it is possible for the
number of cpu PT_NOTEs
to fluctuate as cpus are un/plugged. Onlining/offlining of cpus
does not impact the
number of cpu PT_NOTEs (as the cpus are still present).
From what I understood, crash notes are prepared for possible
CPUs as system boots and
could be used to create a PT_NOTE section for each possible CPU
while generating the elfcorehdr
during the kdump kernel load.
Now once the elfcorehdr is loaded with PT_NOTEs for every
possible CPU there is no need to
regenerate it for CPU hotplug events. Or do we?
For onlining/offlining of cpus, there is no need to regenerate the
elfcorehdr. However,
for actual hot un/plug of cpus, the answer is yes due to
for_each_present_cpu(). The
caveat here of course is that if crash utility is the only
coredump analyzer of concern,
then it doesn't care about these cpu PT_NOTEs and there would be
no need to re-generate them.
Also, I'm not sure if ARM cpu hotplug, which is just now coming
into mainstream, impacts
any of this.
Perhaps the one item that might help here is to distinguish
between actual hot un/plug of
cpus, versus onlining/offlining. At the moment, I can not
distinguish between a hot plug
event and an online event (and unplug/offline). If those were
distinguishable, then we
could only regenerate on un/plug events.
Or perhaps moving to for_each_possible_cpu() is the better choice?
Yes, because once elfcorehdr is built with possible CPUs we don't
have to worry about
hot[un]plug case.
Here is my view on how things should be handled if a core-dump
analyzer is dependent on
elfcorehdr PT_NOTEs to find online/offline CPUs.
A PT_NOTE in elfcorehdr holds the address of the corresponding
crash notes (kernel has
one crash note per CPU for every possible CPU). Though the crash
notes are allocated
during the boot time they are populated when the system is on the
crash path.
This is how crash notes are populated on PowerPC and I am expecting
it would be something
similar on other architectures too.
The crashing CPU sends IPI to every other online CPU with a
callback function that updates the
crash notes of that specific CPU. Once the IPI completes the
crashing CPU updates its own crash
note and proceeds further.
The crash notes of CPUs remain uninitialized if the CPUs were
offline or hot unplugged at the time
system crash. The core-dump analyzer should be able to identify
[un]/initialized crash notes
and display the information accordingly.
Thoughts?
- Sourabh
I've been examining what it would mean to move to
for_each_possible_cpu() in crash_prepare_elf64_headers(). I think it
means:
- Changing for_each_present_cpu() to for_each_possible_cpu() in
crash_prepare_elf64_headers().
- For kexec_load() syscall path, rewrite the incoming/supplied
elfcorehdr immediately on the load with the elfcorehdr generated by
crash_prepare_elf64_headers().
- Eliminate/remove the cpuhp machinery for handling crash hotplug
events.
If for_each_present_cpu is replaced with for_each_possible_cpu I
still need cpuhp machinery
to update FDT kexec segment for CPU hot add case.
Ah, ok, that's important! So the cpuhp callbacks are still needed.
This would then setup PT_NOTEs for all possible cpus, which should
in theory accommodate crash analyzers that rely on ELF PT_NOTEs for
crash_notes.
If staying with for_each_present_cpu() is ultimately decided, then I
think leaving the cpuhp machinery in place and each arch could
decide how to handle crash cpu hotplug events. The overhead for
doing this is very minimal, and the events are likely very infrequent.
I agree. Some architectures may need cpuhp machinery to update kexec
segment[s] other then elfcorehdr. For example FDT on PowerPC.
- Sourabh Jain
OK, I was thinking that the desire was to eliminate the cpuhp
callbacks. In reality, the desire is to change to
for_each_possible_cpu(). Given that the kernel creates crash_notes for
all possible cpus upon kernel boot, there seems to be no reason to not
do this?
HOWEVER...
It's not clear to me that this particular change needs to be part of
this series. It's inclusion would facilitate PPC support, but doesn't
"solve" anything in general. In fact it causes kexec_load and
kexec_file_load to deviate (kexec_load via userspace kexec does the
equivalent of for_each_present_cpu() where as with this change
kexec_file_load would do for_each_possible_cpu(); until a hot plug
event then both would do for_each_possible_cpu()). And if this change
were to arrive as part of Sourabh's PPC support, then it does not
appear to impact x86 (not sure about other arches). And the 'crash'
dump analyzer doesn't care either way.
Including this change would enable an optimization path (for x86 at
least) that short-circuits cpu hotplug changes in the arch crash
handler, for example:
diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
index aca3f1817674..0883f6b11de4 100644
--- a/arch/x86/kernel/crash.c
+++ b/arch/x86/kernel/crash.c
@@ -473,6 +473,11 @@ void arch_crash_handle_hotplug_event(struct
kimage *image)
unsigned long mem, memsz;
unsigned long elfsz = 0;
+ if (image->file_mode && (
+ image->hp_action == KEXEC_CRASH_HP_ADD_CPU ||
+ image->hp_action == KEXEC_CRASH_HP_REMOVE_CPU))
+ return;
+
/*
* Create the new elfcorehdr reflecting the changes to CPU and/or
* memory resources.
I'm not sure that is compelling given the infrequent nature of cpu
hotplug events.
It certainly closes/reduces the window where kdump is not active due
kexec segment update.|
In my mind I still have a question about kexec_load() path. The
userspace kexec can not do the equivalent of for_each_possible_cpu().
It can obtain max possible cpus from /sys/devices/system/cpu/possible,
but for those cpus not present the /sys/devices/system/cpu/cpuXX is
not available and so the crash_notes entries is not available. My
attempts to expose all cpuXX lead to odd behavior that was requiring
changes in ACPI and arch code that looked untenable.
There seem to be these options available for kexec_load() path:
- immediately rewrite the elfcorehdr upon load via a call to
crash_prepare_elf64_headers(). I've made this work with the following,
as proof of concept:
Yes regenerating/patching the elfcorehdr could be an option for
kexec_load syscall.
diff --git a/kernel/kexec.c b/kernel/kexec.c
index cb8e6e6f983c..4eb201270f97 100644
--- a/kernel/kexec.c
+++ b/kernel/kexec.c
@@ -163,6 +163,12 @@ static int do_kexec_load(unsigned long entry,
unsigned long
kimage_free(image);
out_unlock:
kexec_unlock();
+ if (IS_ENABLED(CONFIG_CRASH_HOTPLUG)) {
+ if ((flags & KEXEC_ON_CRASH) && kexec_crash_image) {
+ crash_handle_hotplug_event(KEXEC_CRASH_HP_NONE,
KEXEC_CRASH_HP_INVALID_CPU);
+ }
+ }
return ret;
}
- Another option is spend the time to determine whether exposing all
cpuXX is a viable solution; I have no idea what impacts to userspace
would be for possible-but-not-yet-present cpuXX entries would be. It
might also mean requiring a 'present' entry available within the cpuXX.
- Another option is to simply let the hot plug events rewrite the
elfcorehdr on demand. This is what I've originally put forth, but not
sure how this impacts PPC given for_each_possible_cpu() change.
Given that /sys/devices/system/cpu/cpuXX is not present for
possbile-but-not-yet-present CPUs, I am wondering do we even have crash
notes for possible CPUs on x86?
The concern is that today, both kexec_load and kexec_file_load mirror
each other with respect to for_each_present_cpu(); that is userspace
kexec is able to generate the elfcorehdr the same as would
kexec_file_load, for cpus. But by changing to for_each_possible_cpu(),
the two would deviate.
Thanks,
Sourabh Jain
_______________________________________________
kexec mailing list
kexec@xxxxxxxxxxxxxxxxxxx
http://lists.infradead.org/mailman/listinfo/kexec