Re: [PATCH v18 5/7] kexec: exclude hot remove cpu from elfcorehdr notes

Sourabh Jain <sourabhjain@xxxxxxxxxxxxx> · Mon, 13 Feb 2023 10:10:24 +0530

On 11/02/23 06:05, Eric DeVolder wrote:

On 2/10/23 00:29, Sourabh Jain wrote:

On 10/02/23 01:09, Eric DeVolder wrote:

On 2/9/23 12:43, Sourabh Jain wrote:
Hello Eric,

On 09/02/23 23:01, Eric DeVolder wrote:

On 2/8/23 07:44, Thomas Gleixner wrote:
Eric!

On Tue, Feb 07 2023 at 11:23, Eric DeVolder wrote:
On 2/1/23 05:33, Thomas Gleixner wrote:

So my latest solution is introduce two new CPUHP states, 
CPUHP_AP_ELFCOREHDR_ONLINE
for onlining and CPUHP_BP_ELFCOREHDR_OFFLINE for offlining. I'm 
open to better names.

The CPUHP_AP_ELFCOREHDR_ONLINE needs to be placed after 
CPUHP_BRINGUP_CPU. My
attempts at locating this state failed when inside the STARTING 
section, so I located
this just inside the ONLINE sectoin. The crash hotplug handler 
is registered on
this state as the callback for the .startup method.

The CPUHP_BP_ELFCOREHDR_OFFLINE needs to be placed before 
CPUHP_TEARDOWN_CPU, and I
placed it at the end of the PREPARE section. This crash hotplug 
handler is also
registered on this state as the callback for the .teardown method.

TBH, that's still overengineered. Something like this:

bool cpu_is_alive(unsigned int cpu)
{
    struct cpuhp_cpu_state *st = per_cpu_ptr(&cpuhp_state, cpu);

    return data_race(st->state) <= CPUHP_AP_IDLE_DEAD;
}

and use this to query the actual state at crash time. That spares 
all
those callback heuristics.

I'm making my way though percpu crash_notes, elfcorehdr, 
vmcoreinfo,
makedumpfile and (the consumer of it all) the userspace crash 
utility,
in order to understand the impact of moving from 
for_each_present_cpu()
to for_each_online_cpu().

Is the packing actually worth the trouble? What's the actual win?

Thanks,

         tglx

Thomas,
I've investigated the passing of crash notes through the vmcore. 
What I've learned is that:

- linux/fs/proc/vmcore.c (which makedumpfile references to do its 
job) does
  not care what the contents of cpu PT_NOTES are, but it does 
coalesce them together.

- makedumpfile will count the number of cpu PT_NOTES in order to 
determine its
  nr_cpus variable, which is reported in a header, but otherwise 
unused (except
  for sadump method).

- the crash utility, for the purposes of determining the cpus, 
does not appear to
  reference the elfcorehdr PT_NOTEs. Instead it locates the various
  cpu_[possible|present|online]_mask and computes nr_cpus from 
that, and also of
  course which are online. In addition, when crash does reference 
the cpu PT_NOTE,
  to get its prstatus, it does so by using a percpu technique 
directly in the vmcore
  image memory, not via the ELF structure. Said differently, it 
appears to me that
  crash utility doesn't rely on the ELF PT_NOTEs for cpus; rather 
it obtains them
  via kernel cpumasks and the memory within the vmcore.

With this understanding, I did some testing. Perhaps the most 
telling test was that I
changed the number of cpu PT_NOTEs emitted in the 
crash_prepare_elf64_headers() to just 1,
hot plugged some cpus, then also took a few offline sparsely via 
chcpu, then generated a
vmcore. The crash utility had no problem loading the vmcore, it 
reported the proper number
of cpus and the number offline (despite only one cpu PT_NOTE), and 
changing to a different
cpu via 'set -c 30' and the backtrace was completely valid.

My take away is that crash utility does not rely upon ELF cpu 
PT_NOTEs, it obtains the
cpu information directly from kernel data structures. Perhaps at 
one time crash relied
upon the ELF information, but no more. (Perhaps there are other 
crash dump analyzers
that might rely on the ELF info?)

So, all this to say that I see no need to change 
crash_prepare_elf64_headers(). There
is no compelling reason to move away from for_each_present_cpu(), 
or modify the list for
online/offline.

Which then leaves the topic of the cpuhp state on which to 
register. Perhaps reverting
back to the use of CPUHP_BP_PREPARE_DYN is the right answer. There 
does not appear to
be a compelling need to accurately track whether the cpu went 
online/offline for the
purposes of creating the elfcorehdr, as ultimately the crash 
utility pulls that from
kernel data structures, not the elfcorehdr.

I think this is what Sourabh has known and has been advocating for 
an optimization
path that allows not regenerating the elfcorehdr on cpu changes 
(because all the percpu
structs are all laid out). I do think it best to leave that as an 
arch choice.

Since things are clear on how the PT_NOTES are consumed in kdump 
kernel [fs/proc/vmcore.c],
makedumpfile, and crash tool I need your opinion on this:

Do we really need to regenerate elfcorehdr for CPU hotplug events?
If yes, can you please list the elfcorehdr components that changes 
due to CPU hotplug.
Due to the use of for_each_present_cpu(), it is possible for the 
number of cpu PT_NOTEs
to fluctuate as cpus are un/plugged. Onlining/offlining of cpus does 
not impact the
number of cpu PT_NOTEs (as the cpus are still present).

 From what I understood, crash notes are prepared for possible CPUs 
as system boots and
could be used to create a PT_NOTE section for each possible CPU 
while generating the elfcorehdr
during the kdump kernel load.

Now once the elfcorehdr is loaded with PT_NOTEs for every possible 
CPU there is no need to
regenerate it for CPU hotplug events. Or do we?

For onlining/offlining of cpus, there is no need to regenerate the 
elfcorehdr. However,
for actual hot un/plug of cpus, the answer is yes due to 
for_each_present_cpu(). The
caveat here of course is that if crash utility is the only coredump 
analyzer of concern,
then it doesn't care about these cpu PT_NOTEs and there would be no 
need to re-generate them.

Also, I'm not sure if ARM cpu hotplug, which is just now coming into 
mainstream, impacts
any of this.

Perhaps the one item that might help here is to distinguish between 
actual hot un/plug of
cpus, versus onlining/offlining. At the moment, I can not 
distinguish between a hot plug
event and an online event (and unplug/offline). If those were 
distinguishable, then we
could only regenerate on un/plug events.

Or perhaps moving to for_each_possible_cpu() is the better choice?

Yes, because once elfcorehdr is built with possible CPUs we don't 
have to worry about
hot[un]plug case.

Here is my view on how things should be handled if a core-dump 
analyzer is dependent on
elfcorehdr PT_NOTEs to find online/offline CPUs.

A PT_NOTE in elfcorehdr holds the address of the corresponding crash 
notes (kernel has
one crash note per CPU for every possible CPU). Though the crash 
notes are allocated
during the boot time they are populated when the system is on the 
crash path.

This is how crash notes are populated on PowerPC and I am expecting 
it would be something
similar on other architectures too.

The crashing CPU sends IPI to every other online CPU with a callback 
function that updates the
crash notes of that specific CPU. Once the IPI completes the crashing 
CPU updates its own crash
note and proceeds further.

The crash notes of CPUs remain uninitialized if the CPUs were offline 
or hot unplugged at the time
system crash. The core-dump analyzer should be able to identify 
[un]/initialized crash notes
and display the information accordingly.

Thoughts?

- Sourabh

In general, I agree with your points. You've presented a strong case 
to go with for_each_possible_cpu() in crash_prepare_elf64_headers() 
and those crash notes would always be present, and we can ignore 
changes to cpus wrt/ elfcorehdr updates.

But what do we do about kexec_load() syscall? The way the userspace 
utility works is it determines cpus by:
 nr_cpus = sysconf(_SC_NPROCESSORS_CONF);
which is not the equivalent of possible_cpus. So the complete list of 
cpu PT_NOTEs is not generated up front. We would need a solution for 
that?
Hello Eric,

The sysconf document says _SC_NPROCESSORS_CONF is processors configured, 
isn't that equivalent to possible CPUs?

What exactly sysconf(_SC_NPROCESSORS_CONF) returns on x86? IIUC, on 
powerPC it is possible CPUs.

In case sysconf(_SC_NPROCESSORS_CONF) is not consistent then we can go with:
/sys/devices/system/cpu/possible for kexec_load case.

Thoughts?

- Sourabh Jain

_______________________________________________
kexec mailing list
kexec@xxxxxxxxxxxxxxxxxxx
http://lists.infradead.org/mailman/listinfo/kexec