Re: RAID1 removing failed disk returns EBUSY

Xiao Ni <xni@xxxxxxxxxx> · Fri, 16 Jan 2015 00:20:12 -0500 (EST)

----- Original Message -----
> From: "Joe Lawrence" <joe.lawrence@xxxxxxxxxxx>
> To: "XiaoNi" <xni@xxxxxxxxxx>
> Cc: "NeilBrown" <neilb@xxxxxxx>, linux-raid@xxxxxxxxxxxxxxx, "Bill Kuzeja" <william.kuzeja@xxxxxxxxxxx>
> Sent: Thursday, January 15, 2015 9:22:10 PM
> Subject: Re: RAID1 removing failed disk returns EBUSY
> 
> On Wed, 14 Jan 2015 20:41:16 +0800
> XiaoNi <xni@xxxxxxxxxx> wrote:
> 
> > On 11/17/2014 07:03 AM, NeilBrown wrote:
> > > On Thu, 13 Nov 2014 09:05:49 -0500 Joe Lawrence<joe.lawrence@xxxxxxxxxxx>
> > > wrote:
> > >
> > >> On Wed, 29 Oct 2014 13:36:04 -0400
> > >> Joe Lawrence<joe.lawrence@xxxxxxxxxxx>  wrote:
> > >>
> > >>> On Wed, 29 Oct 2014 08:41:13 +1100
> > >>> NeilBrown<neilb@xxxxxxx>  wrote:
> > >>>
> > >>>> On Mon, 27 Oct 2014 16:27:48 -0400 Joe
> > >>>> Lawrence<joe.lawrence@xxxxxxxxxxx>
> > >>>> wrote:
> > >>>>
> > >>>>> Hi Neil,
> > >>>>>
> > >>>>> We've encountered changes in MD and mdadm that have broken our
> > >>>>> automated
> > >>>>> disk removal script.  In the past, we've been able to run the
> > >>>>> following
> > >>>>> after a RAID1 disk component removal:
> > >>>>>
> > >>>>> % echo fail>  /sys/block/md3/md/dev-sdr5/state
> > >>>>> % echo remove>  /sys/block/md3/md/dev-sdr5/state
> > >>>>>
> > >>>>> However, the latest RHEL6.6 code drop has rebased to sufficiently
> > >>>>> recent
> > >>>>> MD kernel and mdadm changes, in which the previous commands
> > >>>>> occasionally
> > >>>>> fail like so:
> > >>>>>
> > >>>>> * MD array is usually resyncing or checking
> > >>>>> * Component disk /dev/sdr removed via HBA sysfs PCI removal
> > >>>>> * Following UDEV rule fires:
> > >>>>>
> > >>>>> SUBSYSTEM=="block", ACTION=="remove", ENV{ID_PATH}=="?*", \
> > >>>>>          RUN+="/sbin/mdadm -If $name --path $env{ID_PATH}"
> > >>>>>
> > >>>>> % mdadm --detail /dev/md3
> > >>>>> /dev/md3:
> > >>>>>          Version : 1.1
> > >>>>>    Creation Time : Tue Oct 14 17:31:59 2014
> > >>>>>       Raid Level : raid1
> > >>>>>       Array Size : 25149440 (23.98 GiB 25.75 GB)
> > >>>>>    Used Dev Size : 25149440 (23.98 GiB 25.75 GB)
> > >>>>>     Raid Devices : 2
> > >>>>>    Total Devices : 2
> > >>>>>      Persistence : Superblock is persistent
> > >>>>>
> > >>>>>    Intent Bitmap : Internal
> > >>>>>
> > >>>>>      Update Time : Wed Oct 15 14:22:34 2014
> > >>>>>            State : active, degraded
> > >>>>>   Active Devices : 1
> > >>>>> Working Devices : 1
> > >>>>>   Failed Devices : 1
> > >>>>>    Spare Devices : 0
> > >>>>>
> > >>>>>             Name : localhost.localdomain:3
> > >>>>>             UUID : 40ed68ee:ba41d4cd:28c361ed:be7470b8
> > >>>>>           Events : 142
> > >>>>>
> > >>>>>      Number   Major   Minor   RaidDevice State
> > >>>>>         0      65       21        0      faulty
> > >>>>>         1      65        5        1      active sync   /dev/sdj5
> > >>>>>
> > >>>>> All attempts to remove this device fail:
> > >>>>>
> > >>>>> % echo remove>  /sys/block/md3/md/dev-sdr5/state
> > >>>>> -bash: echo: write error: Device or resource busy
> > >>>>>
> > >>>>> This can be traced to state_store():
> > >>>>>
> > >>>>>          } else if (cmd_match(buf, "remove")) {
> > >>>>>                  if (rdev->raid_disk>= 0)
> > >>>>>                          err = -EBUSY;
> > >>>>>
> > >>>>> After much debugging and systemtapping, I think I've figured out that
> > >>>>> the
> > >>>>> sysfs scripting may fail after the following combination of changes:
> > >>>>>
> > >>>>> mdadm  8af530b07fce "Enhance incremental removal."
> > >>>>> kernel 30b8feb730f9 "md/raid5: avoid deadlock when raid5 array has
> > >>>>> unack
> > >>>>>                       badblocks during md_stop_writes"
> > >>>>>
> > >>>>> With these two changes:
> > >>>>>
> > >>>>> 1 - On the user side, mdadm is trying to set the array_state to
> > >>>>> read-auto
> > >>>>>      on incremental removal (as invoked by UDEV rule).
> > >>>>>
> > >>>>> 2 - Kernel side, md_set_readonly() will set the MD_RECOVERY_FROZEN
> > >>>>> flag,
> > >>>>>      wake up the mddev->thread and if there is a sync_thread, it will
> > >>>>>      set
> > >>>>>      MD_RECOVERY_INTR and then wait until the sync_thread is set to
> > >>>>>      NULL.
> > >>>>>
> > >>>>>      When md_check_recovery() gets a chance to run as part of the
> > >>>>>      raid1d() mddev->thread, it may or may not ever get to
> > >>>>>      an invocation of remove_and_add_spares(), for there are but
> > >>>>>      *many*
> > >>>>>      conditional early exits along the way -- for example, if
> > >>>>>      MD_RECOVERY_FROZEN is set, the following condition will bounce
> > >>>>>      out of
> > >>>>>      the routine:
> > >>>>>
> > >>>>>                  if
> > >>>>>                  (!test_and_clear_bit(MD_RECOVERY_NEEDED,&mddev->recovery)
> > >>>>>                  ||
> > >>>>>                      test_bit(MD_RECOVERY_FROZEN,&mddev->recovery))
> > >>>>>                          goto unlock;
> > >>>>>
> > >>>>>      the next time around, MD_RECOVERY_NEEDED will have been cleared,
> > >>>>>      so
> > >>>>>      all future tests will return 0 and the negation will always take
> > >>>>>      the
> > >>>>>      early exit path.
> > >>>>>
> > >>>>>      Back in md_set_readonly(), it may notice that the MD is still in
> > >>>>>      use,
> > >>>>>      so it clears the MD_RECOVERY_FROZEN and then returns -EBUSY,
> > >>>>>      without
> > >>>>>      setting mddev->ro.  But the damage has been done as conditions
> > >>>>>      have
> > >>>>>      been set such that md_check_recovery() will never call
> > >>>>>      remove_and_add_spares().
> > >>>>>
> > >>>>> This would also explain why an "idle" sync_action clears the wedge:
> > >>>>> it
> > >>>>> sets MD_RECOVERY_NEEDED allowing md_check_recovery() to continue
> > >>>>> executing
> > >>>>> to remove_and_add_spares().
> > >>>>>
> > >>>>> As far as I can tell, this is what is happening to prevent the
> > >>>>> "remove"
> > >>>>> write to /sys/block/md3/md/dev-sdr5/state from succeeding.  There are
> > >>>>> certainly a lot of little bit-states between disk removal, UDEV
> > >>>>> mdadm, and
> > >>>>> various MD kernel threads, so apologies if I missed an important
> > >>>>> transition.
> > >>>>>
> > >>>>> Would you consider writing "idle" to the MD array sync_action file as
> > >>>>> a
> > >>>>> safe and reasonable intermediate workaround step for our script?
> > >>>>>
> > >>>>> And of course, any suggestions to whether this is intended behavior
> > >>>>> (ie,
> > >>>>> the removed component disk is failed, but stuck in the array)?
> > >>>>>
> > >>>>> This is fairly easy for us to reproduce with multiple MD arrays per
> > >>>>> disk
> > >>>>> (one per partition) and interrupting a raid check on all of them
> > >>>>> (especially when they are delayed waiting for the first to finish) by
> > >>>>> removing the component disk via sysfs PCI removal.  We can provide
> > >>>>> additional debug or testing if required.
> > >>>>>
> > >>>> Hi Joe,
> > >>>>   thanks for the details analysis!!
> > >>>>
> > >>>> I think the correct fix would be that MD_RECOVERY_NEEDED should be set
> > >>>> after
> > >>>> clearing MD_RECOVERY_FROZEN, like the patch below.
> > >>>> Can you confirm that it works for you?
> > >>>>
> > >>>> Writing 'idle' should in general be safe, so that could be used as an
> > >>>> interim.
> > >>>>
> > >>>> Thanks,
> > >>>> NeilBrown
> > >>>>
> > >>>> diff --git a/drivers/md/md.c b/drivers/md/md.c
> > >>>> index c03d87b6890a..2c73fcb82593 100644
> > >>>> --- a/drivers/md/md.c
> > >>>> +++ b/drivers/md/md.c
> > >>>> @@ -5261,6 +5261,7 @@ static int md_set_readonly(struct mddev *mddev,
> > >>>> struct block_device *bdev)
> > >>>>   		printk("md: %s still in use.\n",mdname(mddev));
> > >>>>   		if (did_freeze) {
> > >>>>   			clear_bit(MD_RECOVERY_FROZEN,&mddev->recovery);
> > >>>> +			set_bit(MD_RECOVERY_NEEDED,&mddev->recovery);
> > >>>>   			md_wakeup_thread(mddev->thread);
> > >>>>   		}
> > >>>>   		err = -EBUSY;
> > >>>> @@ -5275,6 +5276,8 @@ static int md_set_readonly(struct mddev *mddev,
> > >>>> struct block_device *bdev)
> > >>>>   		mddev->ro = 1;
> > >>>>   		set_disk_ro(mddev->gendisk, 1);
> > >>>>   		clear_bit(MD_RECOVERY_FROZEN,&mddev->recovery);
> > >>>> +		set_bit(MD_RECOVERY_NEEDED,&mddev->recovery);
> > >>>> +		md_wakeup_thread(mddev->thread);
> > >>>>   		sysfs_notify_dirent_safe(mddev->sysfs_state);
> > >>>>   		err = 0;
> > >>>>   	}
> > >>>> @@ -5318,6 +5321,7 @@ static int do_md_stop(struct mddev *mddev, int
> > >>>> mode,
> > >>>>   		mutex_unlock(&mddev->open_mutex);
> > >>>>   		if (did_freeze) {
> > >>>>   			clear_bit(MD_RECOVERY_FROZEN,&mddev->recovery);
> > >>>> +			set_bit(MD_RECOVERY_NEEDED,&mddev->recovery);
> > >>>>   			md_wakeup_thread(mddev->thread);
> > >>>>   		}
> > >>>>   		return -EBUSY;
> > >>> Hi Neil,
> > >>>
> > >>> In my tests, the UDEV "mdadm -If" invocation fails *and* removes the
> > >>> pulled disk from the MD array.  This is okay for our intentions, but I
> > >>> wanted to make sure that it's okay to skip any failed-but-not-removed
> > >>> state.
> > >>>
> > >>> Tested-by: Joe Lawrence<joe.lawrence@xxxxxxxxxxx>
> > >>>
> > >>> and should this have a
> > >>>
> > >>> Fixes: 30b8feb730f9 ("md/raid5: avoid deadlock when raid5 array has
> > >>> unack badblocks during md_stop_writes")
> > >>>
> > >>> tag to mark for stable?
> > >>
> > >> Hi Neil,
> > >>
> > >> Would you like me to write up a proper patch, or is this one in the
> > >> queue?
> > >>
> > > Several times over the last week I've thought that I should probably push
> > > that patch along ... but each time something else seemed more
> > > interesting.
> > > But it's a new week now.  I've just posted a pull request.
> > >
> > > Thanks for the prompt (and the report and testing of course).
> > >
> > > NeilBrown
> > Hi Neil and Joe
> > 
> >      Any update for this? I tried this with 3.18.2 and the problem still
> > exists.
> > 
> >     When it tried to remove the failed disk. it find the Blocked flag in
> > rdev->flags is
> > set. So it can't remove the disk. Is this the right reason?
> 
> Hi Xiao,
> 
> It's been a while since I've looked at this patch, but it looks like it
> made it into 3.18, so it should be present on 3.18.2.
> 
> What version of mdadm are you running?
> 
> Does writing an "idle" sync_action clear this condition?
> 
> Regards,
> 
> -- Joe
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

Hi Joe

   Thanks for reminding me. I didn't do that. Now it can remove successfully after writing
"idle" to sync_action.

   I thought wrongly that the patch referenced in this mail is fixed for the problem.

Best Regards
Xiao

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html