Re: [PATCH v18 5/7] kexec: exclude hot remove cpu from elfcorehdr notes

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On 2/28/23 12:52, Eric DeVolder wrote:


On 2/28/23 06:44, Baoquan He wrote:
On 02/13/23 at 10:10am, Sourabh Jain wrote:

On 11/02/23 06:05, Eric DeVolder wrote:


On 2/10/23 00:29, Sourabh Jain wrote:

On 10/02/23 01:09, Eric DeVolder wrote:


On 2/9/23 12:43, Sourabh Jain wrote:
Hello Eric,

On 09/02/23 23:01, Eric DeVolder wrote:


On 2/8/23 07:44, Thomas Gleixner wrote:
Eric!

On Tue, Feb 07 2023 at 11:23, Eric DeVolder wrote:
On 2/1/23 05:33, Thomas Gleixner wrote:

So my latest solution is introduce two new CPUHP
states, CPUHP_AP_ELFCOREHDR_ONLINE
for onlining and CPUHP_BP_ELFCOREHDR_OFFLINE for
offlining. I'm open to better names.

The CPUHP_AP_ELFCOREHDR_ONLINE needs to be
placed after CPUHP_BRINGUP_CPU. My
attempts at locating this state failed when
inside the STARTING section, so I located
this just inside the ONLINE sectoin. The crash
hotplug handler is registered on
this state as the callback for the .startup method.

The CPUHP_BP_ELFCOREHDR_OFFLINE needs to be
placed before CPUHP_TEARDOWN_CPU, and I
placed it at the end of the PREPARE section.
This crash hotplug handler is also
registered on this state as the callback for the .teardown method.

TBH, that's still overengineered. Something like this:

bool cpu_is_alive(unsigned int cpu)
{
     struct cpuhp_cpu_state *st = per_cpu_ptr(&cpuhp_state, cpu);

     return data_race(st->state) <= CPUHP_AP_IDLE_DEAD;
}

and use this to query the actual state at crash
time. That spares all
those callback heuristics.

I'm making my way though percpu crash_notes,
elfcorehdr, vmcoreinfo,
makedumpfile and (the consumer of it all) the
userspace crash utility,
in order to understand the impact of moving from
for_each_present_cpu()
to for_each_online_cpu().

Is the packing actually worth the trouble? What's the actual win?

Thanks,

          tglx



Thomas,
I've investigated the passing of crash notes through the
vmcore. What I've learned is that:

- linux/fs/proc/vmcore.c (which makedumpfile references
to do its job) does
   not care what the contents of cpu PT_NOTES are, but it
does coalesce them together.

- makedumpfile will count the number of cpu PT_NOTES in
order to determine its
   nr_cpus variable, which is reported in a header, but
otherwise unused (except
   for sadump method).

- the crash utility, for the purposes of determining the
cpus, does not appear to
   reference the elfcorehdr PT_NOTEs. Instead it locates the various
   cpu_[possible|present|online]_mask and computes
nr_cpus from that, and also of
   course which are online. In addition, when crash does
reference the cpu PT_NOTE,
   to get its prstatus, it does so by using a percpu
technique directly in the vmcore
   image memory, not via the ELF structure. Said
differently, it appears to me that
   crash utility doesn't rely on the ELF PT_NOTEs for
cpus; rather it obtains them
   via kernel cpumasks and the memory within the vmcore.

With this understanding, I did some testing. Perhaps the
most telling test was that I
changed the number of cpu PT_NOTEs emitted in the
crash_prepare_elf64_headers() to just 1,
hot plugged some cpus, then also took a few offline
sparsely via chcpu, then generated a
vmcore. The crash utility had no problem loading the
vmcore, it reported the proper number
of cpus and the number offline (despite only one cpu
PT_NOTE), and changing to a different
cpu via 'set -c 30' and the backtrace was completely valid.

My take away is that crash utility does not rely upon
ELF cpu PT_NOTEs, it obtains the
cpu information directly from kernel data structures.
Perhaps at one time crash relied
upon the ELF information, but no more. (Perhaps there
are other crash dump analyzers
that might rely on the ELF info?)

So, all this to say that I see no need to change
crash_prepare_elf64_headers(). There
is no compelling reason to move away from
for_each_present_cpu(), or modify the list for
online/offline.

Which then leaves the topic of the cpuhp state on which
to register. Perhaps reverting
back to the use of CPUHP_BP_PREPARE_DYN is the right
answer. There does not appear to
be a compelling need to accurately track whether the cpu
went online/offline for the
purposes of creating the elfcorehdr, as ultimately the
crash utility pulls that from
kernel data structures, not the elfcorehdr.

I think this is what Sourabh has known and has been
advocating for an optimization
path that allows not regenerating the elfcorehdr on cpu
changes (because all the percpu
structs are all laid out). I do think it best to leave
that as an arch choice.

Since things are clear on how the PT_NOTES are consumed in
kdump kernel [fs/proc/vmcore.c],
makedumpfile, and crash tool I need your opinion on this:

Do we really need to regenerate elfcorehdr for CPU hotplug events?
If yes, can you please list the elfcorehdr components that
changes due to CPU hotplug.
Due to the use of for_each_present_cpu(), it is possible for the
number of cpu PT_NOTEs
to fluctuate as cpus are un/plugged. Onlining/offlining of cpus
does not impact the
number of cpu PT_NOTEs (as the cpus are still present).


  From what I understood, crash notes are prepared for
possible CPUs as system boots and
could be used to create a PT_NOTE section for each possible
CPU while generating the elfcorehdr
during the kdump kernel load.

Now once the elfcorehdr is loaded with PT_NOTEs for every
possible CPU there is no need to
regenerate it for CPU hotplug events. Or do we?

For onlining/offlining of cpus, there is no need to regenerate
the elfcorehdr. However,
for actual hot un/plug of cpus, the answer is yes due to
for_each_present_cpu(). The
caveat here of course is that if crash utility is the only
coredump analyzer of concern,
then it doesn't care about these cpu PT_NOTEs and there would be
no need to re-generate them.

Also, I'm not sure if ARM cpu hotplug, which is just now coming
into mainstream, impacts
any of this.

Perhaps the one item that might help here is to distinguish
between actual hot un/plug of
cpus, versus onlining/offlining. At the moment, I can not
distinguish between a hot plug
event and an online event (and unplug/offline). If those were
distinguishable, then we
could only regenerate on un/plug events.

Or perhaps moving to for_each_possible_cpu() is the better choice?

Yes, because once elfcorehdr is built with possible CPUs we don't
have to worry about
hot[un]plug case.

Here is my view on how things should be handled if a core-dump
analyzer is dependent on
elfcorehdr PT_NOTEs to find online/offline CPUs.

A PT_NOTE in elfcorehdr holds the address of the corresponding crash
notes (kernel has
one crash note per CPU for every possible CPU). Though the crash
notes are allocated
during the boot time they are populated when the system is on the
crash path.

This is how crash notes are populated on PowerPC and I am expecting
it would be something
similar on other architectures too.

The crashing CPU sends IPI to every other online CPU with a callback
function that updates the
crash notes of that specific CPU. Once the IPI completes the
crashing CPU updates its own crash
note and proceeds further.

The crash notes of CPUs remain uninitialized if the CPUs were
offline or hot unplugged at the time
system crash. The core-dump analyzer should be able to identify
[un]/initialized crash notes
and display the information accordingly.

Thoughts?

- Sourabh

In general, I agree with your points. You've presented a strong case to
go with for_each_possible_cpu() in crash_prepare_elf64_headers() and
those crash notes would always be present, and we can ignore changes to
cpus wrt/ elfcorehdr updates.

But what do we do about kexec_load() syscall? The way the userspace
utility works is it determines cpus by:
  nr_cpus = sysconf(_SC_NPROCESSORS_CONF);
which is not the equivalent of possible_cpus. So the complete list of
cpu PT_NOTEs is not generated up front. We would need a solution for
that?
Hello Eric,

The sysconf document says _SC_NPROCESSORS_CONF is processors configured,
isn't that equivalent to possible CPUs?

What exactly sysconf(_SC_NPROCESSORS_CONF) returns on x86? IIUC, on powerPC
it is possible CPUs.

Baoquan,

 From sysconf man page, with my understanding, _SC_NPROCESSORS_CONF is
returning the possible cpus, while _SC_NPROCESSORS_ONLN returns present
cpus. If these are true, we can use them.

Thomas Gleixner has pointed out that:

  glibc tries to evaluate that in the following order:
   1) /sys/devices/system/cpu/cpu*
      That's present CPUs not possible CPUs
   2) /proc/stat
      That's online CPUs
   3) sched_getaffinity()
      That's online CPUs at best. In the worst case it's an affinity mask
      which is set on a process group

meaning that _SC_NPROCESSORS_CONF is not equivalent to possible_cpus(). Furthermore, the /sys/system/devices/cpus/cpuXX entries are not available for not-present-but-possible cpus; thus userspace kexec utility can not write out the elfcorehdr with all possible cpus listed.


But I am wondering why the existing present cpu way is going to be
discarded. Sorry, I tried to go through this thread, it's too long, can
anyone summarize the reason with shorter and clear sentences. Sorry
again for that.

By utilizing for_each_possible_cpu() in crash_prepare_elf64_headers(), in the case of the kexec_file_load(), this change would simplify some issues Sourabh has encountered for PPC support. It would also enable an optimization that permits NOT re-generating the elfcorehdr on cpu changes, as all the [possible] cpus are already described in the elfcorehdr.

I've pointed out that this change would have kexec_load (as kexec-tools can only write out, initially, the present_cpus()) initially deviate from kexec_file_load (which would now write out the possible_cpus()). This deviation would disappear after the first hotplug event (due to calling crash_prepare_elf64_headers()). Or I've provided a simple way for kexec_load to rewrite its elfcorehdr upon initial load (by calling into the crash hotplug handler).

Can you think of any side effects of going to for_each_possible_cpu()?

Thanks,
eric

Well, this won't be shorter sentences, but hopefully it makes the case clearer. Below I've cut-n-pasted my current patch w/ commit message which explains it all.

Please let me know if you can think of any side effects not addressed!
Thanks,
eric





In case sysconf(_SC_NPROCESSORS_CONF) is not consistent then we can go with:
/sys/devices/system/cpu/possible for kexec_load case.

Thoughts?

- Sourabh Jain



From b56aa428b07d970f26e3c3704d54ce8805f05ddc Mon Sep 17 00:00:00 2001
From: Eric DeVolder <eric.devolder@xxxxxxxxxx>
Date: Tue, 28 Feb 2023 14:20:04 -0500
Subject: [PATCH v19 3/7] crash: change crash_prepare_elf64_headers() to
 for_each_possible_cpu()

The function crash_prepare_elf64_headers() generates the elfcorehdr
which describes the cpus and memory in the system for the crash kernel.
In particular, it writes out ELF PT_NOTEs for memory regions and the
processors in the system.

With respect to the cpus, the current implementation utilizes
for_each_present_cpu() which means that as cpus are added and removed,
the elfcorehdr must again be updated to reflect the new set of cpus.

The reasoning behind the change to use for_each_possible_cpu(), is:

- At kernel boot time, all percpu crash_notes are allocated for all
  possible cpus; that is, crash_notes are not allocated dynamically
  when cpus are plugged/unplugged. Thus the crash_notes for each
  possible cpu are always available.

- The crash_prepare_elf64_headers() creates an ELF PT_NOTE per cpu.
  Changing to for_each_possible_cpu() is valid as the crash_notes
  pointed to by each cpu PT_NOTE are present and always valid.

Furthermore, examining a common crash processing path of:

 kernel panic -> crash kernel -> makedumpfile -> 'crash' analyzer
           elfcorehdr      /proc/vmcore     vmcore

reveals how the ELF cpu PT_NOTEs are utilized:

- Upon panic, each cpu is sent an IPI and shuts itself down, recording
 its state in its crash_notes. When all cpus are shutdown, the
 crash kernel is launched with a pointer to the elfcorehdr.

- The crash kernel via linux/fs/proc/vmcore.c does not examine or
 use the contents of the PT_NOTEs, it exposes them via /proc/vmcore.

- The makedumpfile utility uses /proc/vmcore and reads the cpu
 PT_NOTEs to craft a nr_cpus variable, which is reported in a
 header but otherwise generally unused. Makedumpfile creates the
 vmcore.

- The 'crash' dump analyzer does not appear to reference the cpu
 PT_NOTEs. Instead it looks-up the cpu_[possible|present|onlin]_mask
 symbols and directly examines those structure contents from vmcore
 memory. From that information it is able to determine which cpus
 are present and online, and locate the corresponding crash_notes.
 Said differently, it appears to me that 'crash' analyzer does not
 rely on the ELF PT_NOTEs for cpus; rather it obtains the information
 directly via kernel symbols and the memory within the vmcore.

(There maybe other vmcore generating and analysis tools that do use
these PT_NOTEs, but 'makedumpfile' and 'crash' seem to me to be the
most common solution.)

This change results in the benefit of having all cpus described in
the elfcorehdr, and therefore reducing the need to re-generate the
elfcorehdr on cpu changes, at the small expense of an additional
56 bytes per PT_NOTE for not-present-but-possible cpus.

On systems where kexec_file_load() syscall is utilized, all the above
is valid. On systems where kexec_load() syscall is utilized, there
may be the need for the elfcorehdr to be regenerated once. The reason
being that some archs only populate the 'present' cpus in the
/sys/devices/system/cpus entries, which the userspace 'kexec' utility
uses to generate the userspace-supplied elfcorehdr. In this situation,
one memory or cpu change will rewrite the elfcorehdr via the
crash_prepare_elf64_headers() function and now all possible cpus will
be described, just as with kexec_file_load() syscall.

Suggested-by: Sourabh Jain <sourabhjain@xxxxxxxxxxxxx>
Signed-off-by: Eric DeVolder <eric.devolder@xxxxxxxxxx>
---
 kernel/crash_core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/crash_core.c b/kernel/crash_core.c
index dba4b75f7541..537b199a8774 100644
--- a/kernel/crash_core.c
+++ b/kernel/crash_core.c
@@ -365,7 +365,7 @@ int crash_prepare_elf64_headers(struct crash_mem *mem, int need_kernel_map,
 	ehdr->e_phentsize = sizeof(Elf64_Phdr);

 	/* Prepare one phdr of type PT_NOTE for each present CPU */
-	for_each_present_cpu(cpu) {
+	for_each_possible_cpu(cpu) {
 		phdr->p_type = PT_NOTE;
 		notes_addr = per_cpu_ptr_to_phys(per_cpu_ptr(crash_notes, cpu));
 		phdr->p_offset = phdr->p_paddr = notes_addr;
--
2.31.1


_______________________________________________
kexec mailing list
kexec@xxxxxxxxxxxxxxxxxxx
http://lists.infradead.org/mailman/listinfo/kexec




[Index of Archives]     [LM Sensors]     [Linux Sound]     [ALSA Users]     [ALSA Devel]     [Linux Audio Users]     [Linux Media]     [Kernel]     [Gimp]     [Yosemite News]     [Linux Media]

  Powered by Linux