Re: Reproducable OOPS with MD RAID-5 on 2.6.0-test11

Nathan Scott <nathans@sgi.com> · Wed, 3 Dec 2003 14:32:29 +1100

On Tue, Dec 02, 2003 at 09:10:02PM +1100, Nathan Scott wrote:
> On Tue, Dec 02, 2003 at 09:27:13AM +0100, Jens Axboe wrote:
> > On Mon, Dec 01 2003, Kevin P. Fleming wrote:
> > > 
> > > without device-mapper in place, though, and I could not reproduce the 
> > > problem! I copied > 500MB of stuff to the XFS filesystem created using 
> > > the entire /dev/md/0 device without a single unusual message. I then 
> > > unmounted the filesystem and used pvcreate/vgcreate/lvcreate to make a 
> > > 3G volume on the array, made an XFS filesystem on it, mounted it, and 
> > > tried copying data over. The oops message came back.
> > 
> > Smells like a bio stacking problem in raid/dm then. I'll take a quick
> > look and see if anything obvious pops up, otherwise the maintainers of
> > those areas should take a closer look.
> 
> One thing that might be of interest - XFS does tend to pass
> variable size requests down to the block layer, and this has
> tripped up md and other drivers in 2.4 in the distant past.
> 
> Log IO is typically 512 byte aligned (as opposed to block or
> page size aligned), as are IOs into several of XFS' metadata
> structures.

The XFS tests just tripped up a panic in raid5 in -test11 -- a kdb
stacktrace follows.  Seems to be reproducible, but not always the
same test that causes it.  And I haven't seen a double bio_put yet,
this first problem keeps getting in the way I guess.

Looks like its in a raid5 kernel thread, doing asynchronous stuff?,
so I don't really have any extra hints about what XFS was doing at
the time for y'all either.

cheers.

--
Nathan

XFS mounting filesystem md0
Unable to handle kernel paging request at virtual address d1c92c00
 printing eip:
c0387be6
*pde = 00048067
*pte = 11c92000
Oops: 0000 [#1]
CPU:    3
EIP:    0060:[<c0387be6>]    Not tainted
EFLAGS: 00010086
EIP is at handle_stripe+0xda6/0xef0
eax: f315df94   ebx: 00000000   ecx: 00000000   edx: f6d25ef8
esi: d1c92bfc   edi: d1c92bfc   ebp: f36d3f88   esp: f36d3ef8
ds: 007b   es: 007b   ss: 0068
Process md0_raid5 (pid: 1435, threadinfo=f36d2000 task=f684a9d0)
Stack: f6d25ef8 f2f84ebc f302e000 00000020 f2f84fc0 f7127000 f712760c f36d3f30
       f315de3c f7101ef8 00000000 00000000 f36d3f3c f315df68 c04fde00 f7f9a9d0
       f684a9d0 df449de8 00000000 f315df94 00000000 00000000 00000001 00000000
Call Trace:
 [<c0388173>] raid5d+0x73/0x120
 [<c039048c>] md_thread+0xbc/0x180
 [<c0118ef0>] default_wake_function+0x0/0x30
 [<c03903d0>] md_thread+0x0/0x180
 [<c010750d>] kernel_thread_helper+0x5/0x18

Code: 8b 56 04 8b 48 58 8b 58 5c 8b 06 83 c1 08 83 d3 00 39 da 72

Entering kdb (current=0xf684a9d0, pid 1435) on processor 3 Oops: Oops
due to oops @ 0xc0387be6
eax = 0xf315df94 ebx = 0x00000000 ecx = 0x00000000 edx = 0xf6d25ef8
esi = 0xd1c92bfc edi = 0xd1c92bfc esp = 0xf36d3ef8 eip = 0xc0387be6
ebp = 0xf36d3f88 xss = 0xc0390068 xcs = 0x00000060 eflags = 0x00010086
xds = 0xf6d2007b xes = 0x0000007b origeax = 0xffffffff &regs = 0xf36d3ec4
[3]kdb> bt
Stack traceback for pid 1435
0xf684a9d0     1435        1  1    3   R  0xf684ad00 *md0_raid5
EBP        EIP        Function (args)
0xf36d3f88 0xc0387be6 handle_stripe+0xda6 (0xf315dea0, 0x292, 0xf36d2000, 0xf5e90578, 0xf5e90580)
                               kernel <NULL> 0x0 0xc0386e40 0xc0387d30
0xf36d3fa4 0xc0388173 raid5d+0x73 (0xf6d25ef8, 0x0, 0xf36d2000, 0xf36d2000, 0xf36d2000)
                               kernel <NULL> 0x0 0xc0388100 0xc0388220
0xf36d3fec 0xc039048c md_thread+0xbc
                               kernel <NULL> 0x0 0xc03903d0 0xc0390550
           0xc010750d kernel_thread_helper+0x5
                               kernel <NULL> 0x0 0xc0107508 0xc0107520
[3]kdb>

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html