RE: Array rebuilding over and over

"Singer, Benedict" <bzsing@xxxxxxxxx> · Fri, 17 Aug 2012 21:25:47 +0000

From: NeilBrown [neilb@xxxxxxx]
Sent: Friday, August 17, 2012 5:17 PM
To: Singer, Benedict
Cc: linux-raid@xxxxxxxxxxxxxxx
Subject: Re: Array rebuilding over and over

On Fri, 17 Aug 2012 17:53:51 +0000 "Singer, Benedict" <bzsing@xxxxxxxxx>
wrote:

> Hello all,
>
> Got a strange issue here. The setup is as follows: 4 RAID5 arrays, 3 drives (partitions - 3 drives total) each. We run some monitoring software to deal with keeping them all happy, ie removing faulty drives from all arrays if it goes bad in one of them, partitioning and adding new drives to all of them, tracking rebuild progress, etc. The situation we're in now is that it's somehow possible to get an array in a state where the rebuild consistently fails; then the monitoring software detects the degraded array + available physical disk, and starts a new rebuild, which then fails again (and again, and again...).
>
> The really baffling part is that no information is present in any logs about why the rebuild fails, making this very hard to diagnose. The array appears to start rebuilding, fail, and finish all within a second, suggesting it's not actually trying to rebuild, but something else is stopping it. The smartctl short selftest reports no errors on any disks. This seems to happen most of the time with md4, and the only way to get out of it is to reboot the system and try the rebuild again, which suggests a kernel issue rather than a hardware issue. I have seen it once on a different array, but in that case the rebuild worked the second time, whereas in the md4 case it appears to be impossible to get out of this state without rebooting. While it's in this state I can remove and add the disk back to md4 and idle the other rebuild to get it to happen on command, so if there are any debugging suggestions it's easy to reproduce on command at this point.
>
> I've checked the kernel config to see if there's any other block/md logging to turn on, but was unsuccessful; watching the console and dmesg don't show any other useful information either.
>
> Here's the relevant info:
>
> uname:
> Linux showstore-1 3.2.23 #1 SMP Mon Aug 13 21:19:58 UTC 2012 i686 GNU/Linux
>
> mdadm:
> mdadm - v3.1.4 - 31st August 2010
>
> /proc/mdstat - this is in the middle of a rebuild cycle (2/4 finished, 1/4 in progress, and the last one (md4) will fail and start the whole process again):
> Personalities : [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
> md4 : active raid5 sda4[3] sdc4[2] sdb4[1]
>       2096640 blocks level 5, 256k chunk, algorithm 2 [3/2] [_UU]
>       resync=DELAYED
>
> md126 : active raid5 sda3[4] sdc3[5] sdb3[3]
>       1936479232 blocks super 1.2 level 5, 256k chunk, algorithm 2 [3/2] [_UU]
>       [=========>...........]  recovery = 49.1% (476073216/968239616) finish=92.8min speed=88333K/sec
>
> md127 : active raid5 sda2[4] sdc2[5] sdb2[3]
>       8386048 blocks super 1.2 level 5, 256k chunk, algorithm 2 [3/3] [UUU]
>
> md1 : active raid5 sda1[0] sdc1[2] sdb1[1]
>       6290944 blocks level 5, 256k chunk, algorithm 2 [3/3] [UUU]
>
> unused devices: <none>
>
> Log entries from syslog and messages from the last time the failure occurred:
>
> syslog (covering the full rebuild cycle on all 4 arrays):
> Aug 17 05:13:20 showstore-1 mdadm[5991]: RebuildStarted event detected on md device /dev/md1
> Aug 17 05:13:24 showstore-1 kernel: EXT2-fs (md1): warning: maximal mount count reached, running e2fsck is recommended
> Aug 17 05:13:26 showstore-1 kernel: scsi_verify_blk_ioctl: 58 callbacks suppressed
> Aug 17 05:13:26 showstore-1 kernel: grub-probe: sending ioctl 1261 to a partition!
> Aug 17 05:14:00 showstore-1 kernel: last message repeated 9 times
> Aug 17 05:14:00 showstore-1 mdadm[5991]: RebuildStarted event detected on md device /dev/md/status
> Aug 17 05:14:00 showstore-1 mdadm[5991]: RebuildFinished event detected on md device /dev/md1
> Aug 17 05:14:01 showstore-1 mdadm[5991]: SpareActive event detected on md device /dev/md1, component device /dev/sda1
> Aug 17 05:14:29 showstore-1 kernel: EXT2-fs (md1): warning: maximal mount count reached, running e2fsck is recommended
> Aug 17 05:14:29 showstore-1 kernel: scsi_verify_blk_ioctl: 612 callbacks suppressed
> Aug 17 05:14:29 showstore-1 kernel: grub-probe: sending ioctl 1261 to a partition!
> Aug 17 05:14:49 showstore-1 kernel: last message repeated 9 times
> Aug 17 05:14:49 showstore-1 mdadm[5991]: RebuildStarted event detected on md device /dev/md/content
> Aug 17 05:14:49 showstore-1 mdadm[5991]: RebuildFinished event detected on md device /dev/md/status
> Aug 17 05:14:49 showstore-1 mdadm[5991]: SpareActive event detected on md device /dev/md/status
> Aug 17 05:48:10 showstore-1 mdadm[5991]: Rebuild20 event detected on md device /dev/md/content
> Aug 17 06:38:10 showstore-1 mdadm[5991]: Rebuild48 event detected on md device /dev/md/content
> Aug 17 07:11:31 showstore-1 mdadm[5991]: Rebuild66 event detected on md device /dev/md/content
> Aug 17 07:44:51 showstore-1 mdadm[5991]: Rebuild82 event detected on md device /dev/md/content
> Aug 17 08:29:41 showstore-1 kernel: md/raid:md4: Disk failure on sda4, disabling device.
> Aug 17 08:29:41 showstore-1 kernel: md/raid:md4: Operation continuing on 2 devices.
> Aug 17 08:29:41 showstore-1 mdadm[5991]: FailSpare event detected on md device /dev/md4, component device /dev/sda4
> Aug 17 08:29:41 showstore-1 mdadm[5991]: RebuildFinished event detected on md device /dev/md/content
> Aug 17 08:29:41 showstore-1 mdadm[5991]: SpareActive event detected on md device /dev/md/content
>
> messages (also covering the full rebuild cycle):
> Aug 17 05:13:20 showstore-1 kernel: md: bind<sda1>
> Aug 17 05:13:20 showstore-1 kernel: md: recovery of RAID array md1
> Aug 17 05:13:20 showstore-1 kernel: md: minimum _guaranteed_  speed: 20000 KB/sec/disk.
> Aug 17 05:13:20 showstore-1 kernel: md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for recovery.
> Aug 17 05:13:20 showstore-1 kernel: md: using 128k window, over a total of 3145472k.
> Aug 17 05:13:21 showstore-1 kernel: md: bind<sda2>
> Aug 17 05:13:21 showstore-1 kernel: md: delaying recovery of md127 until md1 has finished (they share one or more physical units)
> Aug 17 05:13:21 showstore-1 kernel: md: bind<sda3>
> Aug 17 05:13:22 showstore-1 kernel: md: delaying recovery of md126 until md127 has finished (they share one or more physical units)
> Aug 17 05:13:22 showstore-1 kernel: md: delaying recovery of md127 until md1 has finished (they share one or more physical units)
> Aug 17 05:13:22 showstore-1 kernel: md: bind<sda4>
> Aug 17 05:13:22 showstore-1 kernel: md: delaying recovery of md4 until md126 has finished (they share one or more physical units)
> Aug 17 05:13:22 showstore-1 kernel: md: delaying recovery of md127 until md1 has finished (they share one or more physical units)
> Aug 17 05:13:22 showstore-1 kernel: md: delaying recovery of md126 until md127 has finished (they share one or more physical units)
> Aug 17 05:13:26 showstore-1 kernel: grub-probe[21583]: segfault at 18 ip 08049057 sp bfa87b60 error 4 in grub-probe[8048000+1f000]
> Aug 17 05:13:59 showstore-1 kernel: md: md1: recovery done.
> Aug 17 05:14:00 showstore-1 kernel: md: delaying recovery of md126 until md127 has finished (they share one or more physical units)
> Aug 17 05:14:00 showstore-1 kernel: md: recovery of RAID array md127
> Aug 17 05:14:00 showstore-1 kernel: md: minimum _guaranteed_  speed: 20000 KB/sec/disk.
> Aug 17 05:14:00 showstore-1 kernel: md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for recovery.
> Aug 17 05:14:00 showstore-1 kernel: md: using 128k window, over a total of 4193024k.
> Aug 17 05:14:00 showstore-1 kernel: md: delaying recovery of md4 until md126 has finished (they share one or more physical units)
> Aug 17 05:14:49 showstore-1 kernel: md: md127: recovery done.
> Aug 17 05:14:49 showstore-1 kernel: md: delaying recovery of md4 until md126 has finished (they share one or more physical units)
> Aug 17 05:14:49 showstore-1 kernel: md: recovery of RAID array md126
> Aug 17 05:14:49 showstore-1 kernel: md: minimum _guaranteed_  speed: 20000 KB/sec/disk.
> Aug 17 05:14:49 showstore-1 kernel: md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for recovery.
> Aug 17 05:14:49 showstore-1 kernel: md: using 128k window, over a total of 968239616k.
> Aug 17 05:40:27 showstore-1 rsyslogd: -- MARK --
> Aug 17 06:00:27 showstore-1 rsyslogd: -- MARK --
> Aug 17 06:20:28 showstore-1 rsyslogd: -- MARK --
> Aug 17 06:40:28 showstore-1 rsyslogd: -- MARK --
> Aug 17 07:00:28 showstore-1 rsyslogd: -- MARK --
> Aug 17 07:20:28 showstore-1 rsyslogd: -- MARK --
> Aug 17 07:40:28 showstore-1 rsyslogd: -- MARK --
> Aug 17 08:00:28 showstore-1 rsyslogd: -- MARK --
> Aug 17 08:20:28 showstore-1 rsyslogd: -- MARK --
> Aug 17 08:29:41 showstore-1 kernel: md: md126: recovery done.
> Aug 17 08:29:41 showstore-1 kernel: md: recovery of RAID array md4
> Aug 17 08:29:41 showstore-1 kernel: md: minimum _guaranteed_  speed: 20000 KB/sec/disk.
> Aug 17 08:29:41 showstore-1 kernel: md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for recovery.
> Aug 17 08:29:41 showstore-1 kernel: md: using 128k window, over a total of 1048320k.
> Aug 17 08:29:41 showstore-1 kernel: md: md4: recovery done.
>
> Any suggestions gratefully received!

Very strange indeed.

I assume you've tried disabling all of your monitoring code and can still
reproduce the problem with md4??

If you have dynamic debugging enabled in the kernel you could try

 echo file raid5.c +p > /sys/kernel/debug/dynamic_debug/control

(I think that is the right syntax), then reproduce the problem, then

 echo file raid5.c -p > /sys/kernel/debug/dynamic_debug/control

and post the debugging messages that appeared in the kernel.  There might a
hint there.

I can't think of anything else at the moment, though maybe posting the
  mdadm -E
output of the devices in md4 might help.

NeilBrown
---------------------------
Unless the problem has to do with the way we partition the disk and add it to the array, then the monitoring code will have no effect, and indeed, with it in this state I can do the following sequence over and over manually with mdadm: remove failed disk from md4, add to md4, idle the rebuild on md126 -> md4 fails to rebuild.

Here's the -E output for all 3 partitions in that array, which looks fine. I don't have dynamic kernel debugging enabled, but I'll try and get that set up next week and see what I can find; thanks for the tip!

# mdadm -E /dev/sda4
/dev/sda4:
          Magic : a92b4efc
        Version : 0.90.00
           UUID : 675968fd:4ca97ba9:3d186b3c:53958f34
  Creation Time : Wed May 23 10:57:50 2012
     Raid Level : raid5
  Used Dev Size : 1048320 (1023.92 MiB 1073.48 MB)
     Array Size : 2096640 (2047.84 MiB 2146.96 MB)
   Raid Devices : 3
  Total Devices : 3
Preferred Minor : 4

    Update Time : Fri Aug 17 11:48:00 2012
          State : clean
 Active Devices : 2
Working Devices : 3
 Failed Devices : 0
  Spare Devices : 1
       Checksum : 8ddbecae - correct
         Events : 978

         Layout : left-symmetric
     Chunk Size : 256K

      Number   Major   Minor   RaidDevice State
this     3       8        4        3      spare   /dev/sda4

   0     0       0        0        0      removed
   1     1       8       20        1      active sync   /dev/sdb4
   2     2       8       36        2      active sync   /dev/sdc4
   3     3       8        4        3      spare   /dev/sda4
# mdadm -E /dev/sdb4
/dev/sdb4:
          Magic : a92b4efc
        Version : 0.90.00
           UUID : 675968fd:4ca97ba9:3d186b3c:53958f34
  Creation Time : Wed May 23 10:57:50 2012
     Raid Level : raid5
  Used Dev Size : 1048320 (1023.92 MiB 1073.48 MB)
     Array Size : 2096640 (2047.84 MiB 2146.96 MB)
   Raid Devices : 3
  Total Devices : 3
Preferred Minor : 4

    Update Time : Fri Aug 17 11:48:00 2012
          State : clean
 Active Devices : 2
Working Devices : 3
 Failed Devices : 0
  Spare Devices : 1
       Checksum : 8ddbecc0 - correct
         Events : 978

         Layout : left-symmetric
     Chunk Size : 256K

      Number   Major   Minor   RaidDevice State
this     1       8       20        1      active sync   /dev/sdb4

   0     0       0        0        0      removed
   1     1       8       20        1      active sync   /dev/sdb4
   2     2       8       36        2      active sync   /dev/sdc4
   3     3       8        4        3      spare   /dev/sda4
# mdadm -E /dev/sdc4
/dev/sdc4:
          Magic : a92b4efc
        Version : 0.90.00
           UUID : 675968fd:4ca97ba9:3d186b3c:53958f34
  Creation Time : Wed May 23 10:57:50 2012
     Raid Level : raid5
  Used Dev Size : 1048320 (1023.92 MiB 1073.48 MB)
     Array Size : 2096640 (2047.84 MiB 2146.96 MB)
   Raid Devices : 3
  Total Devices : 3
Preferred Minor : 4

    Update Time : Fri Aug 17 11:48:00 2012
          State : clean
 Active Devices : 2
Working Devices : 3
 Failed Devices : 0
  Spare Devices : 1
       Checksum : 8ddbecd2 - correct
         Events : 978

         Layout : left-symmetric
     Chunk Size : 256K

      Number   Major   Minor   RaidDevice State
this     2       8       36        2      active sync   /dev/sdc4

   0     0       0        0        0      removed
   1     1       8       20        1      active sync   /dev/sdb4
   2     2       8       36        2      active sync   /dev/sdc4
   3     3       8        4        3      spare   /dev/sda4

Benedict--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html