Re: linux-image-2.6.32-5-686: kernel BUG at ... build/source_i386_none/drivers/md/raid5.c:2764!

Jose Manuel dos Santos Calhariz <jose.spam@xxxxxxxxxxx> · Mon, 25 Jun 2012 11:59:04 +0100

On Mon, Jun 25, 2012 at 04:42:30PM +1000, NeilBrown wrote:
> On Mon, 25 Jun 2012 11:58:33 +0900 Christian Balzer <chibi@xxxxxxx> wrote:
> 
> > On Mon, 25 Jun 2012 12:39:06 +1000 NeilBrown wrote:
> > 
> > > On Sun, 24 Jun 2012 18:02:34 +0100 Jose Manuel dos Santos Calhariz
> > > <jose.spam@xxxxxxxxxxx> wrote:
> > > 
> > > > On Sun, Jun 24, 2012 at 06:21:46PM +1000, NeilBrown wrote:
> > > > > On Fri, 22 Jun 2012 13:19:53 +0100 Jose Manuel dos Santos Calhariz
> > > > > <jose.spam@xxxxxxxxxxx> wrote:
> > > > > 
> > > > > > 
> > > > > > In another day during the periodic mdadm RAID check: 
> > > > > >  - the linux kernel gave a kernel BUG, 
> > > > > >  - tried to kick out a failed disk and 
> > > > > >  - stopped accepting I/O to the affected raid.  
> > > > > > 
> > > > > > The affected programs were in state D.  The only way to recover
> > > > > > was to do a reboot.  After reboot the problematic disk was
> > > > > > replaced.
> > > > > > 
> > > > > > I reported the bug to Debian and is there all the information
> > > > > > about it:
> > > > > > 
> > > > > > http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=675969
> > > > > > 
> > > > > > I was asked to report the BUG here in case someone knows what
> > > > > > happened.
> > > > > > 
> > > > > > Here is a summary of the more relevant information:
> > > > > > 
> > > > > > This machine have 2 x RAID6 with 6 disks each, for a total of 12
> > > > > > disks. 
> > > > > > 
> > > > > > I have 5 systems with a similar setup and only one failed, maybe
> > > > > > because of the failing disk.  I will use one of the systems to try
> > > > > > to reproduce the bug, before triyng a new kernel.
> > > > > > 
> > > > > > 
> > > > > > The proprietary module is the openafs filesystem v1.6.1 backported
> > > > > > from Debian testing.
> > > > > > 
> > > > > > The kernel bug is:
> > > > > > 
> > > > > > 
> > > > > > build/source_i386_none/drivers/md/raid5.c:2764!
> > > > 
> > > > > 
> > > > > This bug was fixed in 2.6.32.49 and 3.2
> > > > > 
> > > > > http://git.kernel.org/?p=linux/kernel/git/stable/linux-stable.git;a=commitdiff;h=61d433c479a6ccfed6a7e73e6111ca8fa0348c63
> > > > > 
> > > > > http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commitdiff;h=9a3f530f39f4490eaa18b02719fb74ce5f4d2d86
> > > > > 
> > > > > NeilBrown
> > > > 
> > > > The failing kernel had that fix all ready.  The machine was running
> > > > the kernel Debian 2.6.32-41squeeze2.  Looking into the change log,
> > > > this kernel have all the fixes until 2.6.32.51 plus other fixes.
> > > > 
> > > >      Jose Calhariz
> > > > 
> > > 
> > > The oops report said:
> > > 
> > > (2.6.32-5-686 #1)
> > > 
> > > is "5" the same as "41squeeze2" ???  This is a genuine question - I have
> > > little idea about Debian versioning so maybe these are the same thing
> > > somehow.  But they look different.
> > > 
> > Yes, the "name' of the kernel and it's actual detail version are disjunct
> > like that in Debian, the current kernel of that vintage is:
> > ---
> > Package: linux-image-2.6.32-5-amd64
> > Source: linux-2.6
> > Version: 2.6.32-44
> > ---
> 
> Ok.
> So the version number reported by "uname -a" doesn't change when you upgrade
> a Debian kernel?  That's rather sad.
> I means that one has to take the reporters work for which kernel was running
> rather than looking in the oops message for where the kernels tells me
> what version it was.
> 
> Given the report, it is entirely possible that an older kernel was running
> while a newer kernel was installed.
> 
> Jose: how certain are you that the kernel that was running at the time was
> exactly the kernel that was installed at the time.  i.e. you had not
> performed a software update since the last reboot?

Whenever I reboot a server I run a script to collect information about
it: Kernel boot messages, kernel version, kernel modules, md raid
information, etc.

So I have the kernel boot messages for the precise boot that gave the
BUG.  From that boot log:

[    0.000000] Linux version 2.6.32-5-686 (Debian 2.6.32-41squeeze2)
(dannf@xxxxxxxxxx) (gcc version 4.3.5 (Debian 4.3.5-4) ) #1 SMP Mon
Mar 26 05:20:33 UTC 2012

The version of the running kernel is 2.6.32-41squeeze2.  In the
changelog of the Debian package, for version 2.6.32-41: 

   * Add longterm release 2.6.32.54

The complete changelog, in case someone want look into it:
http://packages.debian.org/changelogs/pool/main/l/linux-2.6/linux-2.6_2.6.32-45/changelog

On the previous Debian version 2.6.32-40 there is this entry on the
changelog: 

   * Add longterm release 2.6.32.49, including:
     - SCSI: st: fix race in st_scsi_execute_end
     - NFS/sunrpc: don't use a credential with extra groups.
     - netlink: validate NLA_MSECS length
     - hfs: add sanity check for file name length (CVE-2011-4330)
     - md/raid5: abort any pending parity operations when array fails.
     - mm: avoid null pointer access in vm_struct via /proc/vmallocinfo
     - ipv6: udp: fix the wrong headroom check (CVE-2011-4326)
     - USB: Fix Corruption issue in USB ftdi driver ftdi_sio.c

The complete boot log is on:

http://bugs.debian.org/cgi-bin/bugreport.cgi?msg=15;filename=kernel-boot;att=1;bug=675969

> 
> However even if you can confirm that a new kernel was running I doubt I could
> find an answer.  There isn't really much info to go on.  So unless you can
> reproduce the problem, I doubt I'll even start looking.

I have too much information about the system that gave the BUG, but no
way to sort it out what is relevant and what it's not relevant.  Is
there anything more you would like to know?

I understand if you can't help me.  I have 5 similar servers that are
running 2.6.32.x for 3 months but I have 1 BUG only.  I have one
server where I am trying to reproduce the BUG without avail.  

 - Doing a re-sync of the RAID when there is a "error read corrected"
   don't trigger the BUG.

 - Hot unplug a disk don't trigger the BUG.

My guess is this bug is related with bad disks and errors messages
that sometimes the disks give to the kernel.  But is more difficult to
find disks that give this errors messages in a reproducible way than
finding disks with bad sectors for the test server.

> 
> NeilBrown

-- 
--

Ambição: um supremo desejo de ser vilipendiado por seus inimigos enquanto você está vivo e ser ridicularizado pelos amigos quando estiver morto

--Ambrose Bierce
Attachment:
signature.asc

Description: Digital signature