Re: Handle the NT_PRSTATUS lost for the "bt" command

Dave Anderson <anderson@xxxxxxxxxx> · Tue, 19 Jun 2012 14:25:50 -0400 (EDT)

----- Original Message -----

> 
> OK, now I'm getting confused...
>

The more I look at this patch, the more confused I get...

During initialization, the ELF notes contained in the
dumpfile file header are scanned, and if an NT_PRSTATUS
note is seen, a pointer to its location in the dumpfile
is saved in dd->nt_prstatus_percpu[num] and the "num"
of valid notes is kept in dd->num_prstatus_notes.

If the dd->num_prstatus_notes is equal to the online cpu
count, then it is presumed that there is a one-to-one 
relationship, where the cpu number can be used as the 
index into the dd->nt_prstatus_percpu[num] array.

If the number of notes is not equal to the number of online
cpus, then the "mapping" function is called, where if a
cpu is found to be offline, then its (incorrectly) associated
entry in the dd->nt_prstatus_percpu[num] array is "pushed up"
to the next higher entry.  But the dd->num_prstatus_notes
does not seem to get incremented to reflect that move, so
then it's seems like diskdump_get_prstatus_percpu() can
possibly return NULL when there actually is a relevant 
NT_PRSTATUS note.  

That seems to be a bug (?), but it's not particularly important, 
because for x86 and x86_64, the data in the NT_PRSTATUS notes is 
only used if the starting point for backtraces if the PC/SP pair
cannot be determined otherwise, which is the case virtually all of
the time.  So the registers found in the NT_PRSTATUS notes are 
pretty much useless...

Now, to complicate matters, your patch does not look at the
NT_PRSTATUS notes in the dumpfile header, but instead looks
at the base kernel's original notes, and verifies their
contents, and correlates what's found there against what was
found in the dumpfile?  So I don't understand what you are 
attempting to do -- what is the difference between the notes 
that are copied into the dumpfile vs. what you are looking at 
in the base kernel?

I'm also wondering what would happen in your case if there
were a combination of "lost" notes *and* offline cpus?  How
would that work?

So at this point I really don't want to add this patch
at all because it touches common code, and I don't want to
risk breaking the other arches.  Nobody has ever reported
any "lost" cpus so far, probably because the kdump facility
uses non-maskable NMI's to shutdown the non-panicking cpus.
This is such a highly-unlikely corner case, that it does 
not even seem worth addressing for fear of breaking something
else.

I didn't look at the reasoning behind why you ran into a
segmentation violation, but since the PPC code path would be:

 ...
   back_trace()
    get_diskdump_regs()
      get_diskdump_regs_ppc()

perhaps you can rework your patch so that it is segregated
to PPC only? 

Dave

--
Crash-utility mailing list
Crash-utility@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/crash-utility