Re: meta: should i chase this down?

NeilBrown <neilb@xxxxxxx> · Wed, 7 Dec 2011 11:47:26 +1100

On Tue, 06 Dec 2011 16:02:44 -0800 Keith Keller
<kkeller@xxxxxxxxxxxxxxxxxxxxxxxxxx> wrote:

> Hi all,
> 
> A little while back, I had a strange issue, where reshaping a RAID6 to
> add a disk, then performing significant write activity (in this case, an
> rsnapshot), would cause a kernel crash.  I only attempted this twice,
> and neglected to write down the kernel oops errors, but I saw a few
> calls that seemed to imply that the md driver might be involved.  (Doing
> the same write activity during a rebuild is fine, which is another
> reason I suspected the reshape code in the md driver.  If it's of
> interest, I'm using kernel 2.6.39-4.el5.elrepo from ELRepo on a CentOS
> 5.7 box.)  It's certainly possible that I have a hardware issue, but not
> being able to reliably replicate the issue outside a reshape complicates
> debugging.
> 
> My question is, should I try to hunt down the actual source of this
> crash, and if so, what would be the best way to go about that?  I am
> decidedly not a kernel developer, and am not familiar with how to obtain
> debugging information in that environment.  I'm happy enough for this
> machine to suffer crashes, but I prefer not to work with the existing
> RAID6 if possible, and would want a more reliable way of collecting the
> kernel's debug output beyond writing it down on paper.  :)
> 

I'm always happy to receive detailed crash reports.  However I cannot measure
how much your time is worth, nor can I guarantee that what you find wont
already have been fixed (though 2.6.39 is quite recent and I don't recall any
recent kernel-crash-during-reshape bugs, not can I find any in a quick scan
through the logs).
So I cannot advise you on whether it is "worth the effort".  I would
appreciate it though.

The best way I have found to catch kernel messages is using netconsole.
See Documentation/networking/netconsole.txt

You need a wired network port and another machine on the same network that
can capture the messages.

You almost certainly need some disks to make the RAID6 out of.  You could try
loop-back devices over files but the timing is likely to be very different
and so the chance of reproducing the bug correspondingly small.

But if you do manage to get a crash message I would be very happy to
interpret it and work to fix the bug that causes it.

Thanks,
NeilBrown
Attachment:
signature.asc

Description: PGP signature