Re: [PATCH] md: Fix bug where new drives added to an md array sometimes don't sync properly.

Eli Stair <estair@xxxxxxx> · Tue, 10 Oct 2006 13:42:23 -0700

Thanks Neil,

I just gave this patched module a shot on four systems.  So far, I 
haven't seen the device number inappropriately increment, though as per 
 a mail I sent a short while ago that seemed remedied by using the 1.2 
superblock, for some reason.  However, it appears to have introduced a 
new issue, and another is unresolved by it:



// BUG 1
The single-command syntax to fail and remove a drive is still failing, I 
do not know if this is somehow contributing to the further (new) issues 
below:

  [root@gtmp06 tmp]# mdadm /dev/md0 --fail /dev/dm-0 --remove /dev/dm-0
  mdadm: set /dev/dm-0 faulty in /dev/md0
  mdadm: hot remove failed for /dev/dm-0: Device or resource busy

  [root@gtmp06 tmp]# mdadm /dev/md0 --remove /dev/dm-0
  mdadm: hot removed /dev/dm-0


// BUG 2
Now, upon adding or re-adding a "fail...remove"'d drive, it is not used 
for resync.  I realized previously that added drives weren't re-synced 
until the existing array build was done, then they were grabbed.  This 
however is a clean/active array that is rejecting the drive.

I've performed this identically on both a clean & active array, as well 
as a newly-created (resync'ing) array, to the same effect.  Even after 
rebuild or reboot, the removed drive isn't taken back and remains listed 
as a "faulty spare", with dmesg indicating that it is "non-fresh".




// DMESG:

md: kicking non-fresh dm-0 from array!


// ARRAY status 'mdadm -D /dev/md0'

          State : active, degraded
 Active Devices : 13
Working Devices : 13
 Failed Devices : 1
  Spare Devices : 0

         Layout : near=1, offset=2
     Chunk Size : 512K

           Name : 0
           UUID : 05c2faf4:facfcad3:ba33b140:100f428a
         Events : 22

    Number   Major   Minor   RaidDevice State
       0     253        1        0      active sync   /dev/dm-1
       1     253        2        1      active sync   /dev/dm-2
       2     253        5        2      active sync   /dev/dm-5
       3     253        4        3      active sync   /dev/dm-4
       4     253        6        4      active sync   /dev/dm-6
       5     253        3        5      active sync   /dev/dm-3
       6     253       13        6      active sync   /dev/dm-13
       7       0        0        7      removed
       8     253        7        8      active sync   /dev/dm-7
       9     253        8        9      active sync   /dev/dm-8
      10     253        9       10      active sync   /dev/dm-9
      11     253       11       11      active sync   /dev/dm-11
      12     253       10       12      active sync   /dev/dm-10
      13     253       12       13      active sync   /dev/dm-12

       7     253        0        -      faulty spare   /dev/dm-0




Let me know what more I can do to help track this down.  I'm reverting 
this patch, since it is behaving less-well than before.  Will be happy 
to try others.

Attached are typescript of the drive remove/add sessions and all output.


/eli


Neil Brown wrote:
On Friday October 6, estair@xxxxxxx wrote:
 >
 > This patch has resolved the immediate issue I was having on 2.6.18 with
 > RAID10.  Previous to this change, after removing a device from the array
 > (with mdadm --remove), physically pulling the device and
 > changing/re-inserting, the "Number" of the new device would be
 > incremented on top of the highest-present device in the array.  Now, it
 > resumes its previous place.
 >
 > Does this look to be 'correct' output for a 14-drive array, which dev 8
 > was failed/removed from then "add"'ed?  I'm trying to determine why the
 > device doesn't get pulled back into the active configuration and
 > re-synced.  Any comments?

Does this patch help?



Fix count of degraded drives in raid10.


Signed-off-by: Neil Brown <neilb@xxxxxxx>

### Diffstat output
 ./drivers/md/raid10.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff .prev/drivers/md/raid10.c ./drivers/md/raid10.c

--- .prev/drivers/md/raid10.c   2006-10-09 14:18:00.000000000 +1000
+++ ./drivers/md/raid10.c       2006-10-05 20:10:07.000000000 +1000
@@ -2079,7 +2079,7 @@ static int run(mddev_t *mddev)
                disk = conf->mirrors + i;

                if (!disk->rdev ||
-                   !test_bit(In_sync, &rdev->flags)) {
+                   !test_bit(In_sync, &disk->rdev->flags)) {
                        disk->head_position = 0;
                        mddev->degraded++;
                }


NeilBrown


Attachment:
gzKY3Inxrxoy.gz

Description: GNU Zip compressed data
Attachment:
gzHKRDSqUyeA.gz

Description: GNU Zip compressed data