[patch 0/9] kdump: Patch series for s390 support

schwidefsky@xxxxxxxxxx (Martin Schwidefsky) · Wed, 13 Jul 2011 18:46:11 +0200

On Wed, 13 Jul 2011 12:02:39 -0400
Vivek Goyal <vgoyal at redhat.com> wrote:

> On Mon, Jul 11, 2011 at 05:56:26PM +0200, Martin Schwidefsky wrote:
> 
> [..]
> > > kexec-tools purgatory code already has the checksum logic. So you don't
> > > have to redo that in stand alone tools. I think you probably need to
> > > s390 specic purgatory and jump to IPLing stand alone kernel if kdump
> > > kernel is corrupted instead of rebooting back or spinning infinitely
> > > in the loop/
> > 
> > I can not quite follow you here. The purgatory code is part of the kdump kernel,
> > no? When we trigger a dump with the stand-alone tools we will start executing
> > code in the assembler function of that stand-alone tools. We can not trust
> > the kdump kernel yet, not without doing the checksums first.
> 
> Purgatory is another piece of binary code which is loaded along with kdump
> kernel in reserved memory area. So yes, there is a chance that this code
> itself get corrupted.

Yes, that is one of the possible failure scenarios.

> So in case of stand alone dump, you save the calculated checksum of
> kdump kernel at disk and not in memory? And then calculate the checksum
> of memory image of kdump kernel and decide whether kdump kenrel is 
> corrupted or not?
> 
> If yes, this sounds more reliable as checksum of kernel is stored on
> some disk/tape.

No, the checksum for the purgatory code is stored in memory. If the purgatory
code is corrupted you would have to corrupt the checksum in a very specific
way as well to make it fail. The likelihood for that to happen is very low,
but if it does we still have a fallback plan: before we branch to the
purgatory code we invalidate the checksum. If the purgatory code has been
corrupt although the checksum told us that it is fine the machine will crash
again. If we then start the stand-alone dump tool again it will create a
full dump. But mind you that second IPL of the stand-alone dump tool is only
required for a very, very rare case.

> [..]
> > > Ok. So again why not reuse the checksump capability of kexec-tools and
> > > instead of infinite looping you can jump to stand alone tools + IPL etc.
> > > I understand this will require a tighter integration with kexec-tools
> > > and using ELF header mechanism and will not cover the early kernel
> > > crashes.
> > 
> > Imho the checksum of kexec-tools is in the wrong place.
> 
> Because you think that stored checksum can get corrupted?

No, what I meant is that the code that verifies the checksum has to be part
of the stand-alone dump tool and not the purgatory code.

> [..]
> > > To me we seem to be diverging a lot from existing kdump+kexec-tools
> > > mechanism just to solve the case of early crash dumping. If we break
> > > down the problem in two parts and do thing kexec-tools way (with a
> > > backup path of booting stand alone kernel if kdump kenrel is corrupted),
> > > things might be better.
> > 
> > The "backup path of booting stand alone kernel" would result in passing
> > the control twice, once from the stand-alone dumper to the kexec purgatory
> > (after the purgatory checksum has been verified), then doing more checks 
> > in the kdump kernel, only to return to the stand-alone dumper if some check
> > fails. Does not really sound enticing to me.
> 
> What I am suggesting is that stand alone dumper gets control only if
> kdump kernel is corrupted.
> 
> So following sequence.
> 
> Kernel Crash ---> purgatory --> either kdump kenrel/IPL stand alone tools
> 
> Here only drawback seems to be that we assume that purgatory code and
> pre-calculated checksum has not been corrupted. The big advantage is
> that s390 kdump support looks very similar to other arches and
> understaning and supporting kdump across architectures becomes easy.

My problem with that is the following: how do we get from the "Kernel Crash"
step to the purgatory code? It does work for "normal" panics, but it fails
miserably for a hard crash that does not even get as far as panic. That is
why we insist on a possible second order of things:

Kernel Crash --> IPL of stand-alone dump tool --> branch to kdump if the
checksums turn out ok. 

If the kernel called panic itself and branched to the purgatory code but the
checksum turned out to be bad we just stop there. Then the operator has to
do a manual IPL of the stand-alone dump tool to get the dump.

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.