Paranoia is usually a good thing in this industry and you know this code far better that I do... For the older kernels that don't have cpu_present_map, if they still have the x8664_pda structure, the code my patch changes shouldn't get executed. It's the deprecation of the x8664_pda structure (between SLES10 and SLES11 in our case) that exposes this issue. The setting of the other CPU's to offline (IPI REBOOT_VECTOR) is done in native_smp_send_stop [arch/x86/kernel/smp.c] called by panic(). Note that the SLES11 version of the 2.6.32 kernel allows calling crash_kexec() after calling atomic_notifer_call_chain() in panic(). The flow during an oops or keyboard induced crash does not use this same code. In this case crash_kexec() is called by oops_end() which is called by die(). Jeff -----Original Message----- From: crash-utility-bounces@xxxxxxxxxx [mailto:crash-utility-bounces@xxxxxxxxxx] On Behalf Of Dave Anderson Sent: Thursday, September 23, 2010 1:55 PM To: Discussion list for crash utility usage,maintenance and development Subject: Re: Question on online/present/possible CPUS ----- "Jeffrey Hagen" <Jeffrey.Hagen@xxxxxxxxxxxx> wrote: > Hi Dave, > > Attached is our suggested patch for the issue with CPU count in > an NMI switch induced coredump. Basically the change uses the > cpu_present_mask instead of the cpu_online_mask in x86_64_per_cpu_init > and x86_64_get_smp_cpus. I understand why you need to do it that way, but to make a change like this makes me a little nervous because nobody's ever reported this situation before, and I'm somewhat paranoid it may lead to unexpected behavior. Plus there are old kernels that don't even have a cpu_present_map. > In answer to your question below: "Are you saying that the NMI > switch shutdown handler takes the other cpus offline?" --- Yes!! Where exactly? Can you point me to the kernel code that does that? Dave > > Thanks, > > Jeff > > > -----Original Message----- > From: crash-utility-bounces@xxxxxxxxxx > [mailto:crash-utility-bounces@xxxxxxxxxx] On Behalf Of Dave Anderson > Sent: Thursday, August 12, 2010 6:22 AM > To: Discussion list for crash utility usage,maintenance and > development > Subject: Re: Question on online/present/possible CPUS > > > ----- "Jeffrey Hagen" <Jeffrey.Hagen@xxxxxxxxxxxx> wrote: > > > Hi Petr and Dave, > > > > I have a couple of comments on Petr's email regarding CPU count. > > > > When the dump is the result of an NMI (nmi switch pressed) due to a > hung > > system, one often needs to analyze the state and backtrace for all > the > > CPU's. Since the kernel halts all but CPU0, the crash utility > cannot > > see the other "offline" CPU's. > > I've never seen that behavior before. Probably because I've never > seen > an x86_64 dumpfile that was created as a result of the NMI switch > being > pressed? Anyway, are you saying that the NMI switch shutdown handler > > takes the other cpus offline? > > > This behavior has changed for the x86 architecture somewhere > between > > 2.6.16 (SLES10) and 2.6.32 (SLES11) due to the removal of the > x8664_pda > > structure. > > The function x86_64_init (in x86_64.c) now calls > x86_64_per_cpu_init > > which doesn't count the offline CPUS when calculating the number of > > CPU's. Previously, x86_64_cpu_pda_init (called if x8664_pda > exists), > > didn't check for online/offline status. > > Again -- I've never seen this behaviour before. > > In any case, I'll look at any patch suggestions you guys have in > mind. > > Thanks, > Dave > > > > Regarding #3 in Petr's email. It appears that the set command > won't > > accept a value >= kt_cpus (number of CPUS). It doesn't check if > the > CPU > > is offline or not. > > > > Thanks, > > > > Jeff Hagen > > > > > > > > > > > > Hi all, > > > > > > before making a larger cleanup, I want to ask here for your > > opinion. > > It > > > seems that there is quite a bit of confusion about the meaning of > > CPU > > > count printed out by the crash utility. > > > > > > 1. Number of CPUs > > > > > > Some people think that crash should always output the number of > > CPUs > > in > > > the system (ie. a quad-core server should always output 'CPUS: > 4'), > > > while other people think that only online CPUs should be counted. > > > > > > 2. CPU numbering > > > > > > For example, if there are 4 CPUs in the system, but some of them > > are > > > taken offline (e.g. CPU 1 and CPU 3), _and_ crash output the > number > > of > > > online CPUs, it would print out 'CPUS: 2'. It's not easy to find > > out > > > that valid CPU numbers are 0 and 2 in this case. > > > > Hi Petr, > > > > For all but ppc64, the number shown by the initial banner and the > > "sys" command is essentially "the-highest-cpu-number-plus-one". > > For ppc64 (as requested and implemented by the IBM/ppc64 > > maintainers), > > it shows the number of online cpus. There's reasons for doing it > > either of the two ways, but I'm on vacation now, and you can > research > > the list archives for the various arguments for-and-against doing > it > > either way. Check the changelog.html for when it was changed for > > ppc64, and then cross-reference the revision date with the list > > archives. > > > > > 3. Examining offline CPU > > > > > > Sometimes, it may be useful to examine the state of an offline > CPU. > > Now, > > > I know that the saved state is most likely stale, but it can be > > useful > > > in some cases (e.g. a crash after dropping to kdb). The crash > > utility > > > currently refuses to select an offline CPU with 'set -c #'. Are > > there > > > any concerns about allowing it? > > > > I tend to agree with you, but the only thing that's useful and > > available from an offline cpu is the swapper task for that cpu > > and the runqueue for that cpu. And both of those entities are > > readily accessible if you really need them. Although I don't know > > anything about kdb status, so maybe there's something of per-cpu > > interest, but I don't know why it would be necessary to "set" > > that cpu? > > > > In any case, like I said before, I'm just temporarily online while > > on vacation, and will be back to work on the 9th. > > > > Thanks, > > Dave > > > > -- > > Crash-utility mailing list > > Crash-utility@xxxxxxxxxx > > https://www.redhat.com/mailman/listinfo/crash-utility > > -- > Crash-utility mailing list > Crash-utility@xxxxxxxxxx > https://www.redhat.com/mailman/listinfo/crash-utility > > -- > Crash-utility mailing list > Crash-utility@xxxxxxxxxx > https://www.redhat.com/mailman/listinfo/crash-utility -- Crash-utility mailing list Crash-utility@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/crash-utility -- Crash-utility mailing list Crash-utility@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/crash-utility