My position on general ``RAS'' tool support infrastructure

vgoyal@xxxxxxxxxx (Vivek Goyal) · Tue, 18 Sep 2007 09:58:22 +0530

On Mon, Sep 17, 2007 at 06:38:53PM -0700, Randy Dunlap wrote:
> On Thu, 13 Sep 2007 07:21:10 -0600 Eric W. Biederman wrote:
> 
> > Pete/Piet Delaney <pete at bluelane.com> writes:
> > 
> > > Jason, Eric:
> > >
> > > Did you read Keith Owens suggestion on RAS tools from:
> 
> 
> Yes.  and I re-read it.
> 
> There are several things in Keith's email that make sense:
> 
> a.  all RAS tools should use a common interface
> b.  it's not the kernel's job to decide which RAS tool runs first
> 
> 
> Eric makes some good points too.  I'm mostly similar to Eric:
> paranoid about trusting software/hardware after a panic (or oops).
> 
> So if someone wants to use multiple RAS tools on a panic event,
> enabling an admin to set priorities is OK with me, but I'll only
> trust the first one that is used, and even that one may have
> problems.  IOW, I don't see a big need to support multiple RAS
> tools at one time.  (speaking for myself)
> 

I would be nice to have a kernel debugger co-exist with crash dumping.

I like Eric's idea of debugger putting a break point on panic(). This
would mean that rest of the post panic() actions have to be performed
by second kernel which can perform those actions much more reliably.

But this also brings in the additional requirement of passing all the
required context to second kernel. For example, in the past somebody wanted
to send a message to a remote node that sytem crashed so that standby can
take over.  If the same job has to be done in second kernel, it requires all
the relavant information like remote host IP, port etc passed to the second
kernel which I think makes the job little harder. May be one can pre-configure
these parameters in user space and let the job be done either from initrd
or user space scripts in second kernel.

Thanks
Vivek