Re: 2.6.17-mm5

Andrew Morton <akpm@xxxxxxxx> · Sat, 1 Jul 2006 14:30:47 -0700

On Sat, 1 Jul 2006 15:24:19 +0100
Grant Wilson <grant.wilson@xxxxxxxxx> wrote:

> More RAID1 problems - OOPS on shutdown.

Thanks.  Please copy the mailing lists on these reports - I'm not an MD,
SCSI or SATA developer, and this is in their area.

> [   37.482699] md: Autodetecting RAID arrays.
> [   37.547908] md: autorun ...
> [   37.566449] md: considering sdb2 ...
> [   37.589664] md:  adding sdb2 ...
> [   37.610757] md:  adding sda2 ...
> [   37.632116] md: created md1
> [   37.650587] md: bind<sda2>
> [   37.668571] md: bind<sdb2>
> [   37.686541] md: running: <sdb2><sda2>
> [   37.710807] raid1: raid set md1 active with 2 out of 2 mirrors
> [   37.747557] md: ... autorun DONE.
> [   37.784444] EXT3-fs: INFO: recovery required on readonly filesystem.
> [   37.824275] EXT3-fs: write access will be enabled during recovery.
> [   38.814113] kjournald starting.  Commit interval 5 seconds
> [   38.848761] EXT3-fs: sdc1: orphan cleanup on readonly fs
> [   38.985436] EXT3-fs: sdc1: 7 orphan inodes deleted
> [   39.015845] EXT3-fs: recovery complete.
> [   39.072168] EXT3-fs: mounted filesystem with ordered data mode.
> [   44.693986] Adding 995988k swap on /dev/sda1.  Priority:-1 extents:1 across:995988k
> [   44.744558] Adding 995988k swap on /dev/sdb1.  Priority:-2 extents:1 across:995988k
> [   44.966034] EXT3 FS on sdc1, internal journal
> [   49.305350] device-mapper: ioctl: 4.8.0-ioctl (2006-06-24) initialised: dm-devel@xxxxxxxxxx
> [   64.091331] raid1: Disk failure on sdb2, disabling device. 
> [   64.091333] 	Operation continuing on 1 devices
> [   64.212624] RAID1 conf printout:
> [   64.233951]  --- wd:1 rd:2
> [   64.252195]  disk 0, wo:0, o:1, dev:sda2
> [   64.277712]  disk 1, wo:1, o:0, dev:sdb2
> [   64.305627] RAID1 conf printout:
> [   64.326977]  --- wd:1 rd:2
> [   64.345220]  disk 0, wo:0, o:1, dev:sda2
> [

Which device drivers are being used for these disks?

> [  155.123022] Unable to handle kernel NULL pointer dereference at 0000000000000048 RIP: 
> [  155.155867]  [<ffffffff8047157a>] md_error+0x45/0x91
> [  155.200353] PGD 77954067 PUD 726e5067 PMD 0 
> [  155.226233] Oops: 0000 [1] PREEMPT SMP 
> [  155.249516] last sysfs file: /devices/system/cpu/cpu0/cpufreq/scaling_setspeed
> [  155.292808] CPU 0 
> [  155.304968] Modules linked in: dm_mod evdev
> [  155.330331] Pid: 0, comm: swapper Not tainted 2.6.17-mm5 #1
> [  155.363697] RIP: 0010:[<ffffffff8047157a>]  [<ffffffff8047157a>] md_error+0x45/0x91
> [  155.409638] RSP: 0018:ffffffff807a0c50  EFLAGS: 00010046
> [  155.441445] RAX: 0000000000000000 RBX: ffff81007aa34708 RCX: 000000000000003f
> [  155.484216] RDX: 00000000fffffffb RSI: ffff81007a821d28 RDI: ffff81007aa34708
> [  155.526989] RBP: ffffffff807a0c60 R08: 0000000000000000 R09: ffff81007aac43b0
> [  155.569759] R10: ffffffff804221e5 R11: 0000000000000058 R12: ffff81007aac4ab0
> [  155.612533] R13: ffff81007aac43b0 R14: ffff81007aac4ab0 R15: 00000000fffffffb
> [  155.655303] FS:  00002aeb361606d0(0000) GS:ffffffff80a46000(0000) knlGS:0000000000000000
> [  155.703791] CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
> [  155.738195] CR2: 0000000000000048 CR3: 0000000070997000 CR4: 00000000000006e0
> [  155.780969] Process swapper (pid: 0, threadinfo ffffffff80a64000, task ffffffff80696a00)
> [  155.829404] Stack:  ffff81007a821d28 ffff81007aa34708 ffffffff807a0c80 ffffffff804728d9
> [  155.877840]  ffff81007a821d28 ffff81007aa34708 ffffffff807a0cc0 ffffffff8047409c
> [  155.922535]  00001000807a0d00 ffff81007aac4ab0 00000000fffffffb ffff81007aac4ab0
> [  155.966085] Call Trace:
> [  155.982416]  [<ffffffff804728d9>] super_written+0x30/0x65
> [  156.015292]  [<ffffffff8047409c>] super_written_barrier+0xc4/0xd1
> [  156.052297]  [<ffffffff8023a5a5>] bio_endio+0x56/0x5b
> [  156.082688]  [<ffffffff8022d21b>] __end_that_request_first+0x1c9/0x4c9
> [  156.122068]  [<ffffffff8024a0d6>] end_that_request_first+0xc/0xe
> [  156.158343]  [<ffffffff8036a692>] blk_ordered_complete_seq+0x7c/0x8b
> [  156.196705]  [<ffffffff8036a6d1>] post_flush_end_io+0x30/0x35
> [  156.231419]  [<ffffffff8036a5b5>] end_that_request_last+0xd9/0xf6
> [  156.268215]  [<ffffffff80422204>] scsi_end_request+0xad/0xd7
> [  156.302573]  [<ffffffff80422637>] scsi_io_completion+0x3e1/0x3f0
> [  156.339004]  [<ffffffff8042266c>] scsi_blk_pc_done+0x26/0x28
> [  156.373357]  [<ffffffff8041d11e>] scsi_finish_command+0xa9/0xb2
> [  156.409264]  [<ffffffff804229f9>] scsi_softirq_done+0xf4/0xfd
> [  156.444143]  [<ffffffff80237f66>] blk_done_softirq+0x70/0x7f
> [  156.478323]  [<ffffffff80211366>] __do_softirq+0x67/0xf4
> [  156.510224]  [<ffffffff8025f95e>] call_softirq+0x1e/0x28
> [  156.542083] 
> [  156.542083] Code: 48 8b 40 48 48 85 c0 74 3f ff d0 f0 0f ba ab e0 01 00 00 03 

The barrier code is in there again.

mddev->pers is NULL in md_error(), so the test of
!mddev->pers->error_handler oopsed.  Perhaps this is a real MD bug which is
now being exposed by the new barrier-handling problem.


This should get you further, but...

From: Andrew Morton <akpm@xxxxxxxx>

Cc: Neil Brown <neilb@xxxxxxx>
Signed-off-by: Andrew Morton <akpm@xxxxxxxx>
---

 drivers/md/md.c |    2 ++
 1 file changed, 2 insertions(+)

diff -puN drivers/md/md.c~md-oops-workaround drivers/md/md.c

--- a/drivers/md/md.c~md-oops-workaround
+++ a/drivers/md/md.c
@@ -4586,6 +4586,8 @@ void md_error(mddev_t *mddev, mdk_rdev_t
 		__builtin_return_address(0),__builtin_return_address(1),
 		__builtin_return_address(2),__builtin_return_address(3));
 */
+	if (!mddev->pers)
+		return;
 	if (!mddev->pers->error_handler)
 		return;
 	mddev->pers->error_handler(mddev,rdev);
_

-
: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html