Re: Re[2]: RAID1 submirror failure causes reboot?

Neil Brown <neilb@xxxxxxx> · Mon, 13 Nov 2006 18:17:57 +1100

On Friday November 10, klimov@xxxxxxxxxxx wrote:
> Hello Neil,
> 
> >> [87398.531579] blk: request botched
> NB>                  ^^^^^^^^^^^^^^^^^^^^
> 
> NB> That looks bad.  Possible some bug in the IDE controller or elsewhere
> NB> in the block layer.  Jens: What might cause that?
> 
> NB> --snip--
> 
> NB> That doesn't look like raid was involved.  If it was you would expect
> NB> to see raid1_end_write_request or raid1_end_read_request in that
> NB> trace.
> So that might be the hard or soft part of IDE layer failing the
> system, or a PCI problem for example?

What I think is happening here (and Jens: if you could tell me how
impossible this is, that would be good) is this:

Some error handling somewhere in the low-level ide driver is getting
confused and somehow one the sector counts in the 'struct request' is
getting set wrongly.  blk_recalc_rq_sectors notices this and says
"blk: request botched".  It tries to auto-correct by increasing
rq->nr_sectors to be consistent with other counts.
I'm *guessing* this is the wrong thing to do, and that it has a
side-effect but bi_end_io is getting called on the Bi twice.
The second time the bio has been freed and reused and the wrong
b_end_io is called and it does the wrong thing.

This sounds a bit far-fetched, but it is the only explanation I can
come up with for the observed back trace which is:

[87403.706012]  [<c0103871>] error_code+0x39/0x40
[87403.710794]  [<c0180e0a>] mpage_end_io_read+0x5e/0x72
[87403.716154]  [<c0164af9>] bio_endio+0x56/0x7b
[87403.720798]  [<c0256778>] __end_that_request_first+0x1e0/0x301
[87403.726985]  [<c02568a4>] end_that_request_first+0xb/0xd
[87403.732699]  [<c02bd73c>] __ide_end_request+0x54/0xe1
[87403.738214]  [<c02bd807>] ide_end_request+0x3e/0x5c
[87403.743382]  [<c02c35df>] task_error+0x5b/0x97
[87403.748113]  [<c02c36fa>] task_in_intr+0x6e/0xa2
[87403.753120]  [<c02bf19e>] ide_intr+0xaf/0x12c
[87403.757815]  [<c013e5a7>] handle_IRQ_event+0x23/0x57
[87403.763135]  [<c013e66f>] __do_IRQ+0x94/0xfd
[87403.767802]  [<c0105192>] do_IRQ+0x32/0x68
[87403.772278]  [<c010372e>] common_interrupt+0x1a/0x20

i.e. bio_endio goes straight to mpage_end_io despite the face that the
filesystem is mounted over md/raid1.

Is the kernel compiled with CONFIG_DEBUG_SLAB=y and
CONFIG_DEBUG_PAGEALLOC=y ??
They might help trigger the error earlier and so make the problem more
obvious.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html