Re: Need help recovering RAID5 array

NeilBrown <neilb@xxxxxxx> · Tue, 9 Aug 2011 12:55:49 +1000

On Mon, 8 Aug 2011 22:29:10 -0400 Stephen Muskiewicz
<stephen_muskiewicz@xxxxxxx> wrote:
> 
> Well it looks like the first try didn't work, but adding the --force 
> seems to have done the trick!  Here's the results:
> 

snip

> 
> So it looks like I'm in business again!  Many thanks!

Great!

> 
> This does lead to a question: Do you recommend (and is it safe on CentOS 
> 5.5?) for me to use the updated (3.2.2 with your patch) version of mdadm 
> going forward in place of the CentOS version (2.6.9)?

I wouldn't kept that patch.  It was a little hack to get your array working
again.  I wouldn't recommend using it without expert advice...

Other than that ... 3.2.2 certainly fixes bug and adds features over 2.6.9,
but maybe it adds some bugs too...  I would say that it is safe, but probably
not really necessary.
i.e. up to you :-)

> 
> > I wonder how the event count got that high.  There aren't enough seconds
> > since the birth of the universe of it to have happened naturally...
> >
> Any chance it might be related to these kernel messages? I just noticed 
> (guess I should be paying more attention to my logs) that there are tons 
> of these messages repeated in my /var/log/messages file.  However as far 
> as the RAID arrays themselves, we haven't seen any problems while they 
> are running so I'm not sure what's causing these or whether they are 
> insignificant.  Again, speculation on my part but given the huge event 
> count from mdadm and the number of these messages it might seem that 
> they are somehow related....
> 
> Jul 31 04:02:13 libthumper1 kernel: program diskmond is using a 
> deprecated SCSI
> ioctl, please convert it to SG_IO
> Jul 31 04:02:26 libthumper1 last message repeated 47 times
> Jul 31 04:12:11 libthumper1 kernel: md: bug in file drivers/md/md.c, 
> line 1659

I need to know the exact kernel version to find out what this line is.... I
could guess but I would probably be wrong.

> Jul 31 04:12:11 libthumper1 kernel:
> Jul 31 04:12:11 libthumper1 kernel: md: **********************************
> Jul 31 04:12:11 libthumper1 kernel: md: * <COMPLETE RAID STATE PRINTOUT> *
> Jul 31 04:12:11 libthumper1 kernel: md: **********************************
> Jul 31 04:12:11 libthumper1 kernel: md53: 
> <sdk1><sdai1><sds1><sdam1><sdo1><sdau1><sdaq1><sdw1><sdaa1><sdae1>
> Jul 31 04:12:11 libthumper1 kernel: md: rdev sdk1, SZ:488383744 F:0 S:1 
> DN:10
> Jul 31 04:12:11 libthumper1 kernel: md: rdev superblock:
> Jul 31 04:12:11 libthumper1 kernel: md:  SB: (V:1.0.0) 
> ID:<be475f67.00000000.00000000.00000000> CT:81f4e22f
> Jul 31 04:12:11 libthumper1 kernel: md:     L-2009873429 S1801675106 
> ND:1834971253 RD:1869771369 md114 LO:65536 CS:196610
> Jul 31 04:12:11 libthumper1 kernel: md:     UT:00000000 ST:0 
> AD:976767728 WD:0 FD:976767984 SD:0 CSUM:00000000 E:00000000
> Jul 31 04:12:11 libthumper1 kernel:      D  0:  DISK<N:-1,(-1,-1),R:-1,S:-1>
> Jul 31 04:12:11 libthumper1 kernel:      D  1:  DISK<N:-1,(-1,-1),R:-1,S:-1>
> Jul 31 04:12:11 libthumper1 kernel:      D  2:  DISK<N:-1,(-1,-1),R:-1,S:-1>
> Jul 31 04:12:11 libthumper1 kernel:      D  3:  DISK<N:-1,(-1,-1),R:-1,S:-1>
> Jul 31 04:12:11 libthumper1 kernel: md:     THIS:  DISK<N:0,(0,0),R:0,S:0>
> Jul 31 04:12:11 libthumper1 kernel: md: rdev superblock:
> Jul 31 04:12:11 libthumper1 kernel: md:  SB: (V:1.0.0) 
> ID:<be475f67.00000000.00000000.00000000> CT:81f4e22f
> Jul 31 04:12:11 libthumper1 kernel: md:     L-2009873429 S1801675106 
> ND:1834971253 RD:1869771369 md114 LO:65536 CS:196610
> Jul 31 04:12:11 libthumper1 kernel: md:     UT:00000000 ST:0 
> AD:976767728 WD:0 FD:976767984 SD:0 CSUM:00000000 E:00000000
> 
> <snip...and on and on>

Did it really start repeating at this point?  I would have expected a bit
more first.

So if you get me kernel version and confirm that this really is all in the
logs except for identical repeats, I'll see if I can figure out what might
have caused it - and then if it could be related to your original problem.

> 
> Of course given how old the CentOS mdadm is, maybe by updating it I'll 
> be fixing this problem as well?

In general running newer code should be safer and easier to support.  Don't
know if it would fix this problem yet though.

NeilBrown

> If not, I'd be willing to help delve deeper if it's something worth 
> investigating.
> 
> Again, Thanks a ton for all your help and quick replies!
> 
> Cheers!
> -steve
> 
> > Thanks,
> > NeilBrown
> >
> > diff --git a/super1.c b/super1.c
> > index 35e92a3..4a3341a 100644
> > --- a/super1.c
> > +++ b/super1.c
> > @@ -803,6 +803,8 @@ static int update_super1(struct supertype *st, struct mdinfo *info,
> >   		       __le64_to_cpu(sb->data_size));
> >   	} else if (strcmp(update, "_reshape_progress")==0)
> >   		sb->reshape_position = __cpu_to_le64(info->reshape_progress);
> > +	else if (strcmp(update, "summaries") == 0)
> > +		sb->events = __cpu_to_le64(4);
> >   	else
> >   		rv = -1;
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> > the body of a message to majordomo@xxxxxxxxxxxxxxx
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html