Re: MD bug or me being stupid?

Neil Brown <neilb@xxxxxxxxxxxxxxx> · Fri, 13 May 2005 12:55:29 +1000

On Thursday May 12, molle.bestefich@xxxxxxxxx wrote:
> On 4/22/05, Molle Bestefich wrote:
> > Just upgraded a MD RAID 5 box to 2.6.11 from 2.4.something.
> > 
> > Found out one disk was failing completely, got a replacement from Maxtor.  Neat.
> > Replaced disk, rebooted..
> > Added the new disk to the array with 'raidhotadd'.
> > MD started syncing.
> > 
> > A couple of minutes into the process, it started *seriously* spamming
> > the console with messages:
> > 
> > ==========================
> > Apr 22 01:47:00 linux kernel: ..<6>md: syncing RAID array md1
> > Apr 22 01:47:00 linux kernel: md: minimum _guaranteed_ reconstruction
> > speed: 1000 KB/sec/disc.
> > Apr 22 01:47:00 linux kernel: md: using maximum available idle IO bandwith (but
> > not more than 200000 KB/sec) for reconstruction.
> > Apr 22 01:47:00 linux kernel: md: using 128k window, over a total of
> > 199141632 blocks.
> > Apr 22 01:47:00 linux kernel: md: md1: sync done.
> > Apr 22 01:47:00 linux kernel: ..<6>md: syncing RAID array md1
> > Apr 22 01:47:01 linux kernel: md: minimum _guaranteed_ reconstruction
> > speed: 1000 KB/sec/disc.
> > Apr 22 01:47:01 linux kernel: md: using maximum available idle IO bandwith (but
> > not more than 200000 KB/sec) for reconstruction.
> > Apr 22 01:47:01 linux kernel: md: using 128k window, over a total of
> > 199141632 blocks.
> > Apr 22 01:47:01 linux kernel: md: md1: sync done.
> > ==========================
> 
> [snip]
> 
> > afterwards, I can see that the above messages repeat themselves.
> > cat /var/log/messages | grep md | grep 'Apr 22 01:47:01' | grep 'sync done'
> > tells me that the messages were repeated 12 times per second.  The
> 
> Ping!...
> Neil, just wondering, any comments regarding this particular endless loop in MD?
> (Anything I can test or some such?)

Thanks for the ping, things sometimes get lost in the noise....

This sounds a bit like the problem that is addressed by 
  md-make-raid5-and-raid6-robust-against-failure-during-recovery.patch 
in the current -mm patches (look in the brokenout directory).

This would only happen if you have multiple failed devices.  So maybe
while the rebuild was happening, another device failed (which seems to
happen more and more as device sizes are increasing and reliability is
going the other way).

Could this (another drive failure) be the case?

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html