On 02/13/23 at 10:10am, Sourabh Jain wrote: > > On 11/02/23 06:05, Eric DeVolder wrote: > > > > > > On 2/10/23 00:29, Sourabh Jain wrote: > > > > > > On 10/02/23 01:09, Eric DeVolder wrote: > > > > > > > > > > > > On 2/9/23 12:43, Sourabh Jain wrote: > > > > > Hello Eric, > > > > > > > > > > On 09/02/23 23:01, Eric DeVolder wrote: > > > > > > > > > > > > > > > > > > On 2/8/23 07:44, Thomas Gleixner wrote: > > > > > > > Eric! > > > > > > > > > > > > > > On Tue, Feb 07 2023 at 11:23, Eric DeVolder wrote: > > > > > > > > On 2/1/23 05:33, Thomas Gleixner wrote: > > > > > > > > > > > > > > > > So my latest solution is introduce two new CPUHP > > > > > > > > states, CPUHP_AP_ELFCOREHDR_ONLINE > > > > > > > > for onlining and CPUHP_BP_ELFCOREHDR_OFFLINE for > > > > > > > > offlining. I'm open to better names. > > > > > > > > > > > > > > > > The CPUHP_AP_ELFCOREHDR_ONLINE needs to be > > > > > > > > placed after CPUHP_BRINGUP_CPU. My > > > > > > > > attempts at locating this state failed when > > > > > > > > inside the STARTING section, so I located > > > > > > > > this just inside the ONLINE sectoin. The crash > > > > > > > > hotplug handler is registered on > > > > > > > > this state as the callback for the .startup method. > > > > > > > > > > > > > > > > The CPUHP_BP_ELFCOREHDR_OFFLINE needs to be > > > > > > > > placed before CPUHP_TEARDOWN_CPU, and I > > > > > > > > placed it at the end of the PREPARE section. > > > > > > > > This crash hotplug handler is also > > > > > > > > registered on this state as the callback for the .teardown method. > > > > > > > > > > > > > > TBH, that's still overengineered. Something like this: > > > > > > > > > > > > > > bool cpu_is_alive(unsigned int cpu) > > > > > > > { > > > > > > > struct cpuhp_cpu_state *st = per_cpu_ptr(&cpuhp_state, cpu); > > > > > > > > > > > > > > return data_race(st->state) <= CPUHP_AP_IDLE_DEAD; > > > > > > > } > > > > > > > > > > > > > > and use this to query the actual state at crash > > > > > > > time. That spares all > > > > > > > those callback heuristics. > > > > > > > > > > > > > > > I'm making my way though percpu crash_notes, > > > > > > > > elfcorehdr, vmcoreinfo, > > > > > > > > makedumpfile and (the consumer of it all) the > > > > > > > > userspace crash utility, > > > > > > > > in order to understand the impact of moving from > > > > > > > > for_each_present_cpu() > > > > > > > > to for_each_online_cpu(). > > > > > > > > > > > > > > Is the packing actually worth the trouble? What's the actual win? > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > > tglx > > > > > > > > > > > > > > > > > > > > > > > > > > Thomas, > > > > > > I've investigated the passing of crash notes through the > > > > > > vmcore. What I've learned is that: > > > > > > > > > > > > - linux/fs/proc/vmcore.c (which makedumpfile references > > > > > > to do its job) does > > > > > > not care what the contents of cpu PT_NOTES are, but it > > > > > > does coalesce them together. > > > > > > > > > > > > - makedumpfile will count the number of cpu PT_NOTES in > > > > > > order to determine its > > > > > > nr_cpus variable, which is reported in a header, but > > > > > > otherwise unused (except > > > > > > for sadump method). > > > > > > > > > > > > - the crash utility, for the purposes of determining the > > > > > > cpus, does not appear to > > > > > > reference the elfcorehdr PT_NOTEs. Instead it locates the various > > > > > > cpu_[possible|present|online]_mask and computes > > > > > > nr_cpus from that, and also of > > > > > > course which are online. In addition, when crash does > > > > > > reference the cpu PT_NOTE, > > > > > > to get its prstatus, it does so by using a percpu > > > > > > technique directly in the vmcore > > > > > > image memory, not via the ELF structure. Said > > > > > > differently, it appears to me that > > > > > > crash utility doesn't rely on the ELF PT_NOTEs for > > > > > > cpus; rather it obtains them > > > > > > via kernel cpumasks and the memory within the vmcore. > > > > > > > > > > > > With this understanding, I did some testing. Perhaps the > > > > > > most telling test was that I > > > > > > changed the number of cpu PT_NOTEs emitted in the > > > > > > crash_prepare_elf64_headers() to just 1, > > > > > > hot plugged some cpus, then also took a few offline > > > > > > sparsely via chcpu, then generated a > > > > > > vmcore. The crash utility had no problem loading the > > > > > > vmcore, it reported the proper number > > > > > > of cpus and the number offline (despite only one cpu > > > > > > PT_NOTE), and changing to a different > > > > > > cpu via 'set -c 30' and the backtrace was completely valid. > > > > > > > > > > > > My take away is that crash utility does not rely upon > > > > > > ELF cpu PT_NOTEs, it obtains the > > > > > > cpu information directly from kernel data structures. > > > > > > Perhaps at one time crash relied > > > > > > upon the ELF information, but no more. (Perhaps there > > > > > > are other crash dump analyzers > > > > > > that might rely on the ELF info?) > > > > > > > > > > > > So, all this to say that I see no need to change > > > > > > crash_prepare_elf64_headers(). There > > > > > > is no compelling reason to move away from > > > > > > for_each_present_cpu(), or modify the list for > > > > > > online/offline. > > > > > > > > > > > > Which then leaves the topic of the cpuhp state on which > > > > > > to register. Perhaps reverting > > > > > > back to the use of CPUHP_BP_PREPARE_DYN is the right > > > > > > answer. There does not appear to > > > > > > be a compelling need to accurately track whether the cpu > > > > > > went online/offline for the > > > > > > purposes of creating the elfcorehdr, as ultimately the > > > > > > crash utility pulls that from > > > > > > kernel data structures, not the elfcorehdr. > > > > > > > > > > > > I think this is what Sourabh has known and has been > > > > > > advocating for an optimization > > > > > > path that allows not regenerating the elfcorehdr on cpu > > > > > > changes (because all the percpu > > > > > > structs are all laid out). I do think it best to leave > > > > > > that as an arch choice. > > > > > > > > > > Since things are clear on how the PT_NOTES are consumed in > > > > > kdump kernel [fs/proc/vmcore.c], > > > > > makedumpfile, and crash tool I need your opinion on this: > > > > > > > > > > Do we really need to regenerate elfcorehdr for CPU hotplug events? > > > > > If yes, can you please list the elfcorehdr components that > > > > > changes due to CPU hotplug. > > > > Due to the use of for_each_present_cpu(), it is possible for the > > > > number of cpu PT_NOTEs > > > > to fluctuate as cpus are un/plugged. Onlining/offlining of cpus > > > > does not impact the > > > > number of cpu PT_NOTEs (as the cpus are still present). > > > > > > > > > > > > > > From what I understood, crash notes are prepared for > > > > > possible CPUs as system boots and > > > > > could be used to create a PT_NOTE section for each possible > > > > > CPU while generating the elfcorehdr > > > > > during the kdump kernel load. > > > > > > > > > > Now once the elfcorehdr is loaded with PT_NOTEs for every > > > > > possible CPU there is no need to > > > > > regenerate it for CPU hotplug events. Or do we? > > > > > > > > For onlining/offlining of cpus, there is no need to regenerate > > > > the elfcorehdr. However, > > > > for actual hot un/plug of cpus, the answer is yes due to > > > > for_each_present_cpu(). The > > > > caveat here of course is that if crash utility is the only > > > > coredump analyzer of concern, > > > > then it doesn't care about these cpu PT_NOTEs and there would be > > > > no need to re-generate them. > > > > > > > > Also, I'm not sure if ARM cpu hotplug, which is just now coming > > > > into mainstream, impacts > > > > any of this. > > > > > > > > Perhaps the one item that might help here is to distinguish > > > > between actual hot un/plug of > > > > cpus, versus onlining/offlining. At the moment, I can not > > > > distinguish between a hot plug > > > > event and an online event (and unplug/offline). If those were > > > > distinguishable, then we > > > > could only regenerate on un/plug events. > > > > > > > > Or perhaps moving to for_each_possible_cpu() is the better choice? > > > > > > Yes, because once elfcorehdr is built with possible CPUs we don't > > > have to worry about > > > hot[un]plug case. > > > > > > Here is my view on how things should be handled if a core-dump > > > analyzer is dependent on > > > elfcorehdr PT_NOTEs to find online/offline CPUs. > > > > > > A PT_NOTE in elfcorehdr holds the address of the corresponding crash > > > notes (kernel has > > > one crash note per CPU for every possible CPU). Though the crash > > > notes are allocated > > > during the boot time they are populated when the system is on the > > > crash path. > > > > > > This is how crash notes are populated on PowerPC and I am expecting > > > it would be something > > > similar on other architectures too. > > > > > > The crashing CPU sends IPI to every other online CPU with a callback > > > function that updates the > > > crash notes of that specific CPU. Once the IPI completes the > > > crashing CPU updates its own crash > > > note and proceeds further. > > > > > > The crash notes of CPUs remain uninitialized if the CPUs were > > > offline or hot unplugged at the time > > > system crash. The core-dump analyzer should be able to identify > > > [un]/initialized crash notes > > > and display the information accordingly. > > > > > > Thoughts? > > > > > > - Sourabh > > > > In general, I agree with your points. You've presented a strong case to > > go with for_each_possible_cpu() in crash_prepare_elf64_headers() and > > those crash notes would always be present, and we can ignore changes to > > cpus wrt/ elfcorehdr updates. > > > > But what do we do about kexec_load() syscall? The way the userspace > > utility works is it determines cpus by: > > nr_cpus = sysconf(_SC_NPROCESSORS_CONF); > > which is not the equivalent of possible_cpus. So the complete list of > > cpu PT_NOTEs is not generated up front. We would need a solution for > > that? > Hello Eric, > > The sysconf document says _SC_NPROCESSORS_CONF is processors configured, > isn't that equivalent to possible CPUs? > > What exactly sysconf(_SC_NPROCESSORS_CONF) returns on x86? IIUC, on powerPC > it is possible CPUs. >From sysconf man page, with my understanding, _SC_NPROCESSORS_CONF is returning the possible cpus, while _SC_NPROCESSORS_ONLN returns present cpus. If these are true, we can use them. But I am wondering why the existing present cpu way is going to be discarded. Sorry, I tried to go through this thread, it's too long, can anyone summarize the reason with shorter and clear sentences. Sorry again for that. > > In case sysconf(_SC_NPROCESSORS_CONF) is not consistent then we can go with: > /sys/devices/system/cpu/possible for kexec_load case. > > Thoughts? > > - Sourabh Jain > _______________________________________________ kexec mailing list kexec@xxxxxxxxxxxxxxxxxxx http://lists.infradead.org/mailman/listinfo/kexec