----- Original Message ----- > From: "Joe Lawrence" <joe.lawrence@xxxxxxxxxxx> > To: "XiaoNi" <xni@xxxxxxxxxx> > Cc: "NeilBrown" <neilb@xxxxxxx>, linux-raid@xxxxxxxxxxxxxxx, "Bill Kuzeja" <william.kuzeja@xxxxxxxxxxx> > Sent: Thursday, January 15, 2015 9:22:10 PM > Subject: Re: RAID1 removing failed disk returns EBUSY > > On Wed, 14 Jan 2015 20:41:16 +0800 > XiaoNi <xni@xxxxxxxxxx> wrote: > > > On 11/17/2014 07:03 AM, NeilBrown wrote: > > > On Thu, 13 Nov 2014 09:05:49 -0500 Joe Lawrence<joe.lawrence@xxxxxxxxxxx> > > > wrote: > > > > > >> On Wed, 29 Oct 2014 13:36:04 -0400 > > >> Joe Lawrence<joe.lawrence@xxxxxxxxxxx> wrote: > > >> > > >>> On Wed, 29 Oct 2014 08:41:13 +1100 > > >>> NeilBrown<neilb@xxxxxxx> wrote: > > >>> > > >>>> On Mon, 27 Oct 2014 16:27:48 -0400 Joe > > >>>> Lawrence<joe.lawrence@xxxxxxxxxxx> > > >>>> wrote: > > >>>> > > >>>>> Hi Neil, > > >>>>> > > >>>>> We've encountered changes in MD and mdadm that have broken our > > >>>>> automated > > >>>>> disk removal script. In the past, we've been able to run the > > >>>>> following > > >>>>> after a RAID1 disk component removal: > > >>>>> > > >>>>> % echo fail> /sys/block/md3/md/dev-sdr5/state > > >>>>> % echo remove> /sys/block/md3/md/dev-sdr5/state > > >>>>> > > >>>>> However, the latest RHEL6.6 code drop has rebased to sufficiently > > >>>>> recent > > >>>>> MD kernel and mdadm changes, in which the previous commands > > >>>>> occasionally > > >>>>> fail like so: > > >>>>> > > >>>>> * MD array is usually resyncing or checking > > >>>>> * Component disk /dev/sdr removed via HBA sysfs PCI removal > > >>>>> * Following UDEV rule fires: > > >>>>> > > >>>>> SUBSYSTEM=="block", ACTION=="remove", ENV{ID_PATH}=="?*", \ > > >>>>> RUN+="/sbin/mdadm -If $name --path $env{ID_PATH}" > > >>>>> > > >>>>> % mdadm --detail /dev/md3 > > >>>>> /dev/md3: > > >>>>> Version : 1.1 > > >>>>> Creation Time : Tue Oct 14 17:31:59 2014 > > >>>>> Raid Level : raid1 > > >>>>> Array Size : 25149440 (23.98 GiB 25.75 GB) > > >>>>> Used Dev Size : 25149440 (23.98 GiB 25.75 GB) > > >>>>> Raid Devices : 2 > > >>>>> Total Devices : 2 > > >>>>> Persistence : Superblock is persistent > > >>>>> > > >>>>> Intent Bitmap : Internal > > >>>>> > > >>>>> Update Time : Wed Oct 15 14:22:34 2014 > > >>>>> State : active, degraded > > >>>>> Active Devices : 1 > > >>>>> Working Devices : 1 > > >>>>> Failed Devices : 1 > > >>>>> Spare Devices : 0 > > >>>>> > > >>>>> Name : localhost.localdomain:3 > > >>>>> UUID : 40ed68ee:ba41d4cd:28c361ed:be7470b8 > > >>>>> Events : 142 > > >>>>> > > >>>>> Number Major Minor RaidDevice State > > >>>>> 0 65 21 0 faulty > > >>>>> 1 65 5 1 active sync /dev/sdj5 > > >>>>> > > >>>>> All attempts to remove this device fail: > > >>>>> > > >>>>> % echo remove> /sys/block/md3/md/dev-sdr5/state > > >>>>> -bash: echo: write error: Device or resource busy > > >>>>> > > >>>>> This can be traced to state_store(): > > >>>>> > > >>>>> } else if (cmd_match(buf, "remove")) { > > >>>>> if (rdev->raid_disk>= 0) > > >>>>> err = -EBUSY; > > >>>>> > > >>>>> After much debugging and systemtapping, I think I've figured out that > > >>>>> the > > >>>>> sysfs scripting may fail after the following combination of changes: > > >>>>> > > >>>>> mdadm 8af530b07fce "Enhance incremental removal." > > >>>>> kernel 30b8feb730f9 "md/raid5: avoid deadlock when raid5 array has > > >>>>> unack > > >>>>> badblocks during md_stop_writes" > > >>>>> > > >>>>> With these two changes: > > >>>>> > > >>>>> 1 - On the user side, mdadm is trying to set the array_state to > > >>>>> read-auto > > >>>>> on incremental removal (as invoked by UDEV rule). > > >>>>> > > >>>>> 2 - Kernel side, md_set_readonly() will set the MD_RECOVERY_FROZEN > > >>>>> flag, > > >>>>> wake up the mddev->thread and if there is a sync_thread, it will > > >>>>> set > > >>>>> MD_RECOVERY_INTR and then wait until the sync_thread is set to > > >>>>> NULL. > > >>>>> > > >>>>> When md_check_recovery() gets a chance to run as part of the > > >>>>> raid1d() mddev->thread, it may or may not ever get to > > >>>>> an invocation of remove_and_add_spares(), for there are but > > >>>>> *many* > > >>>>> conditional early exits along the way -- for example, if > > >>>>> MD_RECOVERY_FROZEN is set, the following condition will bounce > > >>>>> out of > > >>>>> the routine: > > >>>>> > > >>>>> if > > >>>>> (!test_and_clear_bit(MD_RECOVERY_NEEDED,&mddev->recovery) > > >>>>> || > > >>>>> test_bit(MD_RECOVERY_FROZEN,&mddev->recovery)) > > >>>>> goto unlock; > > >>>>> > > >>>>> the next time around, MD_RECOVERY_NEEDED will have been cleared, > > >>>>> so > > >>>>> all future tests will return 0 and the negation will always take > > >>>>> the > > >>>>> early exit path. > > >>>>> > > >>>>> Back in md_set_readonly(), it may notice that the MD is still in > > >>>>> use, > > >>>>> so it clears the MD_RECOVERY_FROZEN and then returns -EBUSY, > > >>>>> without > > >>>>> setting mddev->ro. But the damage has been done as conditions > > >>>>> have > > >>>>> been set such that md_check_recovery() will never call > > >>>>> remove_and_add_spares(). > > >>>>> > > >>>>> This would also explain why an "idle" sync_action clears the wedge: > > >>>>> it > > >>>>> sets MD_RECOVERY_NEEDED allowing md_check_recovery() to continue > > >>>>> executing > > >>>>> to remove_and_add_spares(). > > >>>>> > > >>>>> As far as I can tell, this is what is happening to prevent the > > >>>>> "remove" > > >>>>> write to /sys/block/md3/md/dev-sdr5/state from succeeding. There are > > >>>>> certainly a lot of little bit-states between disk removal, UDEV > > >>>>> mdadm, and > > >>>>> various MD kernel threads, so apologies if I missed an important > > >>>>> transition. > > >>>>> > > >>>>> Would you consider writing "idle" to the MD array sync_action file as > > >>>>> a > > >>>>> safe and reasonable intermediate workaround step for our script? > > >>>>> > > >>>>> And of course, any suggestions to whether this is intended behavior > > >>>>> (ie, > > >>>>> the removed component disk is failed, but stuck in the array)? > > >>>>> > > >>>>> This is fairly easy for us to reproduce with multiple MD arrays per > > >>>>> disk > > >>>>> (one per partition) and interrupting a raid check on all of them > > >>>>> (especially when they are delayed waiting for the first to finish) by > > >>>>> removing the component disk via sysfs PCI removal. We can provide > > >>>>> additional debug or testing if required. > > >>>>> > > >>>> Hi Joe, > > >>>> thanks for the details analysis!! > > >>>> > > >>>> I think the correct fix would be that MD_RECOVERY_NEEDED should be set > > >>>> after > > >>>> clearing MD_RECOVERY_FROZEN, like the patch below. > > >>>> Can you confirm that it works for you? > > >>>> > > >>>> Writing 'idle' should in general be safe, so that could be used as an > > >>>> interim. > > >>>> > > >>>> Thanks, > > >>>> NeilBrown > > >>>> > > >>>> diff --git a/drivers/md/md.c b/drivers/md/md.c > > >>>> index c03d87b6890a..2c73fcb82593 100644 > > >>>> --- a/drivers/md/md.c > > >>>> +++ b/drivers/md/md.c > > >>>> @@ -5261,6 +5261,7 @@ static int md_set_readonly(struct mddev *mddev, > > >>>> struct block_device *bdev) > > >>>> printk("md: %s still in use.\n",mdname(mddev)); > > >>>> if (did_freeze) { > > >>>> clear_bit(MD_RECOVERY_FROZEN,&mddev->recovery); > > >>>> + set_bit(MD_RECOVERY_NEEDED,&mddev->recovery); > > >>>> md_wakeup_thread(mddev->thread); > > >>>> } > > >>>> err = -EBUSY; > > >>>> @@ -5275,6 +5276,8 @@ static int md_set_readonly(struct mddev *mddev, > > >>>> struct block_device *bdev) > > >>>> mddev->ro = 1; > > >>>> set_disk_ro(mddev->gendisk, 1); > > >>>> clear_bit(MD_RECOVERY_FROZEN,&mddev->recovery); > > >>>> + set_bit(MD_RECOVERY_NEEDED,&mddev->recovery); > > >>>> + md_wakeup_thread(mddev->thread); > > >>>> sysfs_notify_dirent_safe(mddev->sysfs_state); > > >>>> err = 0; > > >>>> } > > >>>> @@ -5318,6 +5321,7 @@ static int do_md_stop(struct mddev *mddev, int > > >>>> mode, > > >>>> mutex_unlock(&mddev->open_mutex); > > >>>> if (did_freeze) { > > >>>> clear_bit(MD_RECOVERY_FROZEN,&mddev->recovery); > > >>>> + set_bit(MD_RECOVERY_NEEDED,&mddev->recovery); > > >>>> md_wakeup_thread(mddev->thread); > > >>>> } > > >>>> return -EBUSY; > > >>> Hi Neil, > > >>> > > >>> In my tests, the UDEV "mdadm -If" invocation fails *and* removes the > > >>> pulled disk from the MD array. This is okay for our intentions, but I > > >>> wanted to make sure that it's okay to skip any failed-but-not-removed > > >>> state. > > >>> > > >>> Tested-by: Joe Lawrence<joe.lawrence@xxxxxxxxxxx> > > >>> > > >>> and should this have a > > >>> > > >>> Fixes: 30b8feb730f9 ("md/raid5: avoid deadlock when raid5 array has > > >>> unack badblocks during md_stop_writes") > > >>> > > >>> tag to mark for stable? > > >> > > >> Hi Neil, > > >> > > >> Would you like me to write up a proper patch, or is this one in the > > >> queue? > > >> > > > Several times over the last week I've thought that I should probably push > > > that patch along ... but each time something else seemed more > > > interesting. > > > But it's a new week now. I've just posted a pull request. > > > > > > Thanks for the prompt (and the report and testing of course). > > > > > > NeilBrown > > Hi Neil and Joe > > > > Any update for this? I tried this with 3.18.2 and the problem still > > exists. > > > > When it tried to remove the failed disk. it find the Blocked flag in > > rdev->flags is > > set. So it can't remove the disk. Is this the right reason? > > Hi Xiao, > > It's been a while since I've looked at this patch, but it looks like it > made it into 3.18, so it should be present on 3.18.2. > > What version of mdadm are you running? > > Does writing an "idle" sync_action clear this condition? > > Regards, > > -- Joe > > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html > Hi Joe Thanks for reminding me. I didn't do that. Now it can remove successfully after writing "idle" to sync_action. I thought wrongly that the patch referenced in this mail is fixed for the problem. Best Regards Xiao -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html