Re: out of sync raid 5 + xfs = kernel startup problem

Neil Brown <neilb@xxxxxxxxxxxxxxx> · Wed, 13 Apr 2005 14:10:40 +1000

On Tuesday April 12, robey@xxxxxxxxxxxxxxxxxxx wrote:
> My raid5 system recently went through a sequence of power outages.  When 
> everything came back on the drives were out of sync.  No big issue... 
> just sync them back up again.  But something is going wrong.  Any help 
> is appreciated.  dmesg provides the following (the network stuff is 
> mixed in):
> 
..
> md: raidstart(pid 220) used deprecated START_ARRAY ioctl. This will not 
> be supported beyond 2.6

First hint.  Don't use 'raidstart'.  It works OK when everything is
working, but when things aren't working, raidstart makes it worse.

> md: could not bd_claim sdf2.

That's odd... Maybe it is trying to 'claim' it twice, because it
certainly seems to have got it below..

> md: autorun ...
> md: considering sdd2 ...
> md:  adding sdd2 ...
> md:  adding sde2 ...
> md:  adding sdf2 ...
> md:  adding sdc2 ...
> md:  adding sdb2 ...
> md:  adding sda2 ...
> md: created md0
> md: bind<sda2>
> md: bind<sdb2>
> md: bind<sdc2>
> md: bind<sdf2>
> md: bind<sde2>
> md: bind<sdd2>
> md: running: <sdd2><sde2><sdf2><sdc2><sdb2><sda2>
> md: kicking non-fresh sdd2 from array!

So sdd2 is not fresh.  Must have been missing at one stage, so it
probably has old data.

> md: unbind<sdd2>
> md: export_rdev(sdd2)
> md: md0: raid array is not clean -- starting background reconstruction
> raid5: device sde2 operational as raid disk 4
> raid5: device sdf2 operational as raid disk 3
> raid5: device sdc2 operational as raid disk 2
> raid5: device sdb2 operational as raid disk 1
> raid5: device sda2 operational as raid disk 0
> raid5: cannot start dirty degraded array for md0

Here's the main problem.

You've got a degraded, unclean array.  i.e. one drive is
failed/missing and md isn't confident that all the parity blocks are
correct due to an unclean shutdown (could have been in the middle of a
write). 
This means you could have undetectable data corruption.

md wants you to know this an not assume that everything is perfectly
OK.

You can still start the array, but you will need to use
  mdadm --assemble --force
which means you need to boot first ... got a boot CD?

I should add a "raid=force-start" or similar boot option, but I
haven't yet.

So, boot somehow, and
  mdadm --assemble /dev/md0 --force /dev/sd[a-f]2

  mdadm /dev/md0 -a /dev/sdd2

 wait for sync to complete (not absolutely needed).

Reboot.

> XFS: SB read failed
> Unable to handle kernel NULL pointer dereference at 0000000000000000 RIP:
> <ffffffff802c4d5d>{raid5_unplug_device+13}

Hmm.. This is a bit of a worry.. I should be doing
	mddev->queue->unplug_fn = raid5_unplug_device;
	mddev->queue->issue_flush_fn = raid5_issue_flush;
a bit later in drivers/md/raid5.c(run), after the last 'goto
abort'... I'll have to think through it a bit though to be sure.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html