Re: [PATCH] md: Fix bug where new drives added to an md array sometimes don't sync properly.

Eli Stair <estair@xxxxxxx> · Tue, 10 Oct 2006 13:20:29 -0700

Looks like this issue isn't fully resolved after all, after spending 
some time trying to get the re-added drive to sync, I've removed and 
added it again.  This resulted in the previous behaviour I saw, losing 
its original numeric position, and becoming "14".

This now looks 100% repeatable, and appears to look like a race 
condition.  One item of note, is that if I build the array with a 
version 1.2 superblock, this mis-numbering behaviour seems to disappear 
(I've run through it five times since without recurrence).

Doing a single-command fail/remove fails the device but errors on removal:

[root@gtmp03 ~]# mdadm /dev/md0 --fail /dev/dm-13 --remove /dev/dm-13
mdadm: set /dev/dm-13 faulty in /dev/md0
mdadm: hot remove failed for /dev/dm-13: Device or resource busy

    Number   Major   Minor   RaidDevice State
       0     253        0        0      active sync   /dev/dm-0
       1     253        1        1      active sync   /dev/dm-1
       2     253        2        2      active sync   /dev/dm-2
       3     253        3        3      active sync   /dev/dm-3
       4     253        4        4      active sync   /dev/dm-4
       5     253        5        5      active sync   /dev/dm-5
       6     253        6        6      active sync   /dev/dm-6
       7     253        7        7      active sync   /dev/dm-7
       8       0        0        8      removed
       9     253        9        9      active sync   /dev/dm-9
      10     253       10       10      active sync   /dev/dm-10
      11     253       11       11      active sync   /dev/dm-11
      12     253       12       12      active sync   /dev/dm-12
      13     253       13       13      active sync   /dev/dm-13

      14     253        8        -      spare   /dev/dm-8

Eli Stair wrote:

This patch has resolved the immediate issue I was having on 2.6.18 with
RAID10.  Previous to this change, after removing a device from the array
(with mdadm --remove), physically pulling the device and
changing/re-inserting, the "Number" of the new device would be
incremented on top of the highest-present device in the array.  Now, it
resumes its previous place.

Does this look to be 'correct' output for a 14-drive array, which dev 8
was failed/removed from then "add"'ed?  I'm trying to determine why the
device doesn't get pulled back into the active configuration and
re-synced.  Any comments?

Thanks!

/eli

For example, currently when device dm-8 is removed it shows up like this:

     Number   Major   Minor   RaidDevice State
        0     253        0        0      active sync   /dev/dm-0
        1     253        1        1      active sync   /dev/dm-1
        2     253        2        2      active sync   /dev/dm-2
        3     253        3        3      active sync   /dev/dm-3
        4     253        4        4      active sync   /dev/dm-4
        5     253        5        5      active sync   /dev/dm-5
        6     253        6        6      active sync   /dev/dm-6
        7     253        7        7      active sync   /dev/dm-7
        8       0        0        8      removed
        9     253        9        9      active sync   /dev/dm-9
       10     253       10       10      active sync   /dev/dm-10
       11     253       11       11      active sync   /dev/dm-11
       12     253       12       12      active sync   /dev/dm-12
       13     253       13       13      active sync   /dev/dm-13

        8     253        8        -      spare   /dev/dm-8

Previously however, it would come back with the "Number" as 14, not 8 as
it should.  Shortly thereafter things got all out of whack, in addition
to just not working properly :)  Now I've just got to figure out how to
get the re-introduced drive to participate in the array again like it
should.

Eli Stair wrote:
 >
 >
 > I'm actually seeing similar behaviour on RAID10 (2.6.18), where after
 > removing a drive from an array re-adding it sometimes results in it
 > still being listed as a faulty-spare and not being "taken" for resync.
 > In the same scenario, after swapping drives, doing a fail,remove, then
 > an 'add' doesn't work, only a re-add will even get the drive listed by
 > MDADM.
 >
 >
 > What's the failure mode/symptoms that this patch is resolving?
 >
 > Is it possible this affects the RAID10 module/mode as well?  If not,
 > I'll start a new thread for that.  I'm testing this patch to see if it
 > does remedy the situation on RAID10, and will update after some
 > significant testing.
 >
 >
 > /eli
 >
 >
 >
 >
 >
 >
 >
 >
 > NeilBrown wrote:
 >  > There is a nasty bug in md in 2.6.18 affecting at least raid1.
 >  > This fixes it (and has already been sent to stable@xxxxxxxxxx).
 >  >
 >  > ### Comments for Changeset
 >  >
 >  > This fixes a bug introduced in 2.6.18.
 >  >
 >  > If a drive is added to a raid1 using older tools (mdadm-1.x or
 >  > raidtools) then it will be included in the array without any resync
 >  > happening.
 >  >
 >  > It has been submitted for 2.6.18.1.
 >  >
 >  >
 >  > Signed-off-by: Neil Brown <neilb@xxxxxxx>
 >  >
 >  > ### Diffstat output
 >  >  ./drivers/md/md.c |    1 +
 >  >  1 file changed, 1 insertion(+)
 >  >
 >  > diff .prev/drivers/md/md.c ./drivers/md/md.c
 >  > --- .prev/drivers/md/md.c       2006-09-29 11:51:39.000000000 +1000
 >  > +++ ./drivers/md/md.c   2006-10-05 16:40:51.000000000 +1000
 >  > @@ -3849,6 +3849,7 @@ static int hot_add_disk(mddev_t * mddev,
 >  >         }
 >  >         clear_bit(In_sync, &rdev->flags);
 >  >         rdev->desc_nr = -1;
 >  > +       rdev->saved_raid_disk = -1;
 >  >         err = bind_rdev_to_array(rdev, mddev);
 >  >         if (err)
 >  >                 goto abort_export;
 >  > -
 >  > To unsubscribe from this list: send the line "unsubscribe 
linux-raid" in
 >  > the body of a message to majordomo@xxxxxxxxxxxxxxx
 >  > More majordomo info at  http://vger.kernel.org/majordomo-info.html
 >  >
 >
 > -
 > To unsubscribe from this list: send the line "unsubscribe linux-raid" in
 > the body of a message to majordomo@xxxxxxxxxxxxxxx
 > More majordomo info at  http://vger.kernel.org/majordomo-info.html
 >

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html