bug/race in md causing device to wedge in busy state

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I'm seeing cases where an attempted remove of a manually faulted disk from an existing RAID unit can fail with mdadm reporting "Device or resource busy". I've reduced the problem down to the smallest set that reliably reproduces the issue:

Starting with 2 drives (a,b), each with at least 3 partitions:
1) create 3 raid1 md's on the drives using the 3 partitions
2) fault & remove drive b from each of the 3 md's
3) zero the superblock on b so it forgets where it came from (or use a third drive c...) and add drive b back to each of the 3 md's
4) fault & remove drive b from each of the 3 md's

The problem was originally seen sporadically during the remove part of step 2, but is *very* reproducible in the remove part of step 4. I attribute this to the fact that there's guaranteed I/O happening during this step.

Now here's the catch.  If I change step 4 to:
4a) fault drive b from each of the 3 md's
4b) remove drive b from each of the 3 md's
then the removes haven't yet been seen to fail with BUSY yet (i.e. no issues).

But my scripts currently do this instead for each md:
4a) fault drive b from md
4b) sleep 0-10 seconds
4c) remove drive b md
which will fail on the remove from one of the md's, almost guaranteed. It seems odd to me that no amount of sleeping in between these steps can allow me to reliably remove a faulted member of an array.

Some other info:
-I haven't yet seen the problem if I use < 3 partitions.
-I've tried different create options to no avail
-I'm running a Fedora 2.6.31.6 kernel
-I've hooked up to this machine with kgdb and may poke around more if needed

I will use the "fault all" then "remove all" as my workaround if it proves reliable, but I thought it'd be a good idea to report this bug.

To that end, below find 3 things:

1) the simplest possible run of a script that reliably triggers bug
2) the system logs from this run
3) the script used to simulate this bug


1) console output of running simplest use case of script (note md110 is not part of this exercise):

mxmx# ./sim_md_busy.sh -dg1 /dev/sdg -dg2 /dev/sdp -df /dev/sdp -dr /dev/sdp -create -m bug3part
mdadm --create /dev/md0  --force --run --level=1 --raid-devices=2 /dev/sdg1  /dev/sdp1 --assume-clean
mdadm: /dev/sdg1 appears to be part of a raid array:
    level=raid1 devices=2 ctime=Wed Dec 16 19:43:50 2009
mdadm: /dev/sdp1 appears to be part of a raid array:
    level=raid1 devices=2 ctime=Wed Dec 16 19:43:50 2009
mdadm: array /dev/md0 started.
Succeeded after 0 retries
mdadm --create /dev/md1  --force --run --level=1 --raid-devices=2 /dev/sdg3  /dev/sdp3 --assume-clean
mdadm: /dev/sdg3 appears to be part of a raid array:
    level=raid1 devices=2 ctime=Wed Dec 16 19:43:50 2009
mdadm: /dev/sdp3 appears to be part of a raid array:
    level=raid1 devices=2 ctime=Wed Dec 16 19:43:50 2009
mdadm: array /dev/md1 started.
Succeeded after 0 retries
mdadm --create /dev/md2  --force --run --level=1 --raid-devices=2 /dev/sdg4  /dev/sdp4 --assume-clean
mdadm: /dev/sdg4 appears to be part of a raid array:
    level=raid1 devices=2 ctime=Wed Dec 16 19:43:50 2009
mdadm: /dev/sdp4 appears to be part of a raid array:
    level=raid1 devices=2 ctime=Wed Dec 16 19:43:50 2009
mdadm: array /dev/md2 started.
Succeeded after 0 retries
mdadm --verbose  --manage /dev/md0 --fail   /dev/sdp1
mdadm: set /dev/sdp1 faulty in /dev/md0
Succeeded after 0 retries
mdadm --verbose  --manage /dev/md0 --remove   /dev/sdp1
mdadm: hot removed /dev/sdp1
Succeeded after 0 retries
mdadm --verbose  --manage /dev/md1 --fail   /dev/sdp3
mdadm: set /dev/sdp3 faulty in /dev/md1
Succeeded after 0 retries
mdadm --verbose  --manage /dev/md1 --remove   /dev/sdp3
mdadm: hot removed /dev/sdp3
Succeeded after 0 retries
mdadm --verbose  --manage /dev/md2 --fail   /dev/sdp4
mdadm: set /dev/sdp4 faulty in /dev/md2
Succeeded after 0 retries
mdadm --verbose  --manage /dev/md2 --remove   /dev/sdp4
mdadm: hot removed /dev/sdp4
Succeeded after 0 retries
mdadm --verbose  --manage /dev/md0 --add   /dev/sdp1
mdadm: added /dev/sdp1
Succeeded after 0 retries
mdadm --verbose  --manage /dev/md1 --add   /dev/sdp3
mdadm: added /dev/sdp3
Succeeded after 0 retries
mdadm --verbose  --manage /dev/md2 --add   /dev/sdp4
mdadm: added /dev/sdp4
Succeeded after 0 retries
Personalities : [raid1] [raid10] [raid0] [linear]
md2 : active raid1 sdp4[2] sdg4[0]
      4200896 blocks [2/1] [U_]
        resync=DELAYED

md1 : active raid1 sdp3[2] sdg3[0]
      4200896 blocks [2/1] [U_]
        resync=DELAYED

md0 : active raid1 sdp1[2] sdg1[0]
      4200896 blocks [2/1] [U_]
      [>....................]  recovery =  1.5% (65472/4200896) finish=1.0min speed=65472K/sec

md110 : active raid1 sda6[0] sdn6[1]
      1060248 blocks super 1.0 [2/2] [UU]
      bitmap: 1/1 pages [4KB], 1024KB chunk

unused devices: <none>
mdadm --verbose  --manage /dev/md0 --fail   /dev/sdp1
mdadm: set /dev/sdp1 faulty in /dev/md0
Succeeded after 0 retries
mdadm --verbose  --manage /dev/md0 --remove   /dev/sdp1
mdadm: hot removed /dev/sdp1
Succeeded after 0 retries
mdadm --verbose  --manage /dev/md1 --fail   /dev/sdp3
mdadm: set /dev/sdp3 faulty in /dev/md1
Succeeded after 0 retries
mdadm --verbose  --manage /dev/md1 --remove   /dev/sdp3
mdadm: hot remove failed for /dev/sdp3: Device or resource busy
mdadm: hot remove failed for /dev/sdp3: Device or resource busy
mdadm: hot remove failed for /dev/sdp3: Device or resource busy
ERROR: Command failed after 3 retries (mdadm --verbose  --manage /dev/md1 --remove   /dev/sdp3)
mdadm --verbose  --manage /dev/md2 --fail   /dev/sdp4
mdadm: set /dev/sdp4 faulty in /dev/md2
Succeeded after 0 retries
mdadm --verbose  --manage /dev/md2 --remove   /dev/sdp4
mdadm: hot removed /dev/sdp4



2) the logs from that run:

Dec 16 19:44:34 mxmx user.info kernel: md: bind<sdg1>
Dec 16 19:44:34 mxmx user.info kernel: md: bind<sdp1>
Dec 16 19:44:34 mxmx user.info kernel: raid1: raid set md0 active with 2 out of 2 mirrors
Dec 16 19:44:34 mxmx user.info kernel: md0: detected capacity change from 0 to 4301717504
Dec 16 19:44:34 mxmx user.info kernel: md: bind<sdg3>
Dec 16 19:44:34 mxmx user.info kernel: md: bind<sdp3>
Dec 16 19:44:34 mxmx user.info kernel: raid1: raid set md1 active with 2 out of 2 mirrors
Dec 16 19:44:34 mxmx user.info kernel: md1: detected capacity change from 0 to 4301717504
Dec 16 19:44:34 mxmx user.info kernel: md: bind<sdg4>
Dec 16 19:44:34 mxmx user.info kernel: md: bind<sdp4>
Dec 16 19:44:34 mxmx user.info kernel: raid1: raid set md2 active with 2 out of 2 mirrors
Dec 16 19:44:34 mxmx user.info kernel: md2: detected capacity change from 0 to 4301717504
Dec 16 19:44:35 mxmx user.info kernel:  md0:
Dec 16 19:44:35 mxmx user.info kernel:  unknown partition table
Dec 16 19:44:36 mxmx user.alert kernel: raid1: Disk failure on sdp1, disabling device.
Dec 16 19:44:36 mxmx user.alert kernel: raid1: Operation continuing on 1 devices.
Dec 16 19:44:36 mxmx user.warn kernel: RAID1 conf printout:
Dec 16 19:44:36 mxmx user.warn kernel:  --- wd:1 rd:2
Dec 16 19:44:36 mxmx user.warn kernel:  disk 0, wo:0, o:1, dev:sdg1
Dec 16 19:44:36 mxmx user.warn kernel:  disk 1, wo:1, o:0, dev:sdp1
Dec 16 19:44:36 mxmx user.warn kernel: RAID1 conf printout:
Dec 16 19:44:36 mxmx user.warn kernel:  --- wd:1 rd:2
Dec 16 19:44:36 mxmx user.warn kernel:  disk 0, wo:0, o:1, dev:sdg1
Dec 16 19:44:38 mxmx user.info kernel: md: unbind<sdp1>
Dec 16 19:44:38 mxmx user.info kernel: md: export_rdev(sdp1)
Dec 16 19:44:38 mxmx user.info kernel:  md1:
Dec 16 19:44:38 mxmx user.info kernel:  unknown partition table
Dec 16 19:44:38 mxmx user.alert kernel: raid1: Disk failure on sdp3, disabling device.
Dec 16 19:44:38 mxmx user.alert kernel: raid1: Operation continuing on 1 devices.
Dec 16 19:44:38 mxmx user.warn kernel: RAID1 conf printout:
Dec 16 19:44:38 mxmx user.warn kernel:  --- wd:1 rd:2
Dec 16 19:44:38 mxmx user.warn kernel:  disk 0, wo:0, o:1, dev:sdg3
Dec 16 19:44:38 mxmx user.warn kernel:  disk 1, wo:1, o:0, dev:sdp3
Dec 16 19:44:38 mxmx user.warn kernel: RAID1 conf printout:
Dec 16 19:44:38 mxmx user.warn kernel:  --- wd:1 rd:2
Dec 16 19:44:38 mxmx user.warn kernel:  disk 0, wo:0, o:1, dev:sdg3
Dec 16 19:44:40 mxmx user.info kernel: md: unbind<sdp3>
Dec 16 19:44:40 mxmx user.info kernel: md: export_rdev(sdp3)
Dec 16 19:44:40 mxmx user.info kernel:  md2:
Dec 16 19:44:40 mxmx user.info kernel:  unknown partition table
Dec 16 19:44:40 mxmx user.alert kernel: raid1: Disk failure on sdp4, disabling device.
Dec 16 19:44:40 mxmx user.alert kernel: raid1: Operation continuing on 1 devices.
Dec 16 19:44:40 mxmx user.warn kernel: RAID1 conf printout:
Dec 16 19:44:40 mxmx user.warn kernel:  --- wd:1 rd:2
Dec 16 19:44:40 mxmx user.warn kernel:  disk 0, wo:0, o:1, dev:sdg4
Dec 16 19:44:40 mxmx user.warn kernel:  disk 1, wo:1, o:0, dev:sdp4
Dec 16 19:44:40 mxmx user.warn kernel: RAID1 conf printout:
Dec 16 19:44:40 mxmx user.warn kernel:  --- wd:1 rd:2
Dec 16 19:44:40 mxmx user.warn kernel:  disk 0, wo:0, o:1, dev:sdg4
Dec 16 19:44:42 mxmx user.info kernel: md: unbind<sdp4>
Dec 16 19:44:42 mxmx user.info kernel: md: export_rdev(sdp4)
Dec 16 19:44:42 mxmx user.info kernel: md: bind<sdp1>
Dec 16 19:44:42 mxmx user.warn kernel: RAID1 conf printout:
Dec 16 19:44:42 mxmx user.warn kernel:  --- wd:1 rd:2
Dec 16 19:44:42 mxmx user.warn kernel:  disk 0, wo:0, o:1, dev:sdg1
Dec 16 19:44:42 mxmx user.warn kernel:  disk 1, wo:1, o:1, dev:sdp1
Dec 16 19:44:42 mxmx user.info kernel: md: recovery of RAID array md0
Dec 16 19:44:42 mxmx user.info kernel: md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
Dec 16 19:44:42 mxmx user.info kernel: md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for recovery.
Dec 16 19:44:42 mxmx user.info kernel: md: using 128k window, over a total of 4200896 blocks.
Dec 16 19:44:42 mxmx user.info kernel: md: bind<sdp3>
Dec 16 19:44:42 mxmx user.warn kernel: RAID1 conf printout:
Dec 16 19:44:42 mxmx user.warn kernel:  --- wd:1 rd:2
Dec 16 19:44:42 mxmx user.warn kernel:  disk 0, wo:0, o:1, dev:sdg3
Dec 16 19:44:42 mxmx user.warn kernel:  disk 1, wo:1, o:1, dev:sdp3
Dec 16 19:44:42 mxmx user.info kernel: md: delaying recovery of md1 until md0 has finished (they share one or more physical units)
Dec 16 19:44:42 mxmx user.info kernel: md: bind<sdp4>
Dec 16 19:44:42 mxmx user.warn kernel: RAID1 conf printout:
Dec 16 19:44:42 mxmx user.warn kernel:  --- wd:1 rd:2
Dec 16 19:44:42 mxmx user.warn kernel:  disk 0, wo:0, o:1, dev:sdg4
Dec 16 19:44:42 mxmx user.warn kernel:  disk 1, wo:1, o:1, dev:sdp4
Dec 16 19:44:42 mxmx user.info kernel: md: delaying recovery of md2 until md0 has finished (they share one or more physical units)
Dec 16 19:44:42 mxmx user.info kernel: md: delaying recovery of md1 until md2 has finished (they share one or more physical units)
Dec 16 19:44:43 mxmx user.alert kernel: raid1: Disk failure on sdp1, disabling device.
Dec 16 19:44:43 mxmx user.alert kernel: raid1: Operation continuing on 1 devices.
Dec 16 19:44:43 mxmx user.info kernel: md: md0: recovery done.
Dec 16 19:44:44 mxmx user.info kernel: md: delaying recovery of md1 until md2 has finished (they share one or more physical units)
Dec 16 19:44:44 mxmx user.info kernel: md: recovery of RAID array md2
Dec 16 19:44:44 mxmx user.info kernel: md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
Dec 16 19:44:44 mxmx user.info kernel: md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for recovery.
Dec 16 19:44:44 mxmx user.info kernel: md: using 128k window, over a total of 4200896 blocks.
Dec 16 19:44:44 mxmx user.warn kernel: RAID1 conf printout:
Dec 16 19:44:44 mxmx user.warn kernel:  --- wd:1 rd:2
Dec 16 19:44:44 mxmx user.warn kernel:  disk 0, wo:0, o:1, dev:sdg1
Dec 16 19:44:44 mxmx user.warn kernel:  disk 1, wo:1, o:0, dev:sdp1
Dec 16 19:44:44 mxmx user.warn kernel: RAID1 conf printout:
Dec 16 19:44:44 mxmx user.warn kernel:  --- wd:1 rd:2
Dec 16 19:44:44 mxmx user.warn kernel:  disk 0, wo:0, o:1, dev:sdg1
Dec 16 19:44:45 mxmx user.info kernel: md: unbind<sdp1>
Dec 16 19:44:45 mxmx user.info kernel: md: export_rdev(sdp1)
Dec 16 19:44:45 mxmx user.alert kernel: raid1: Disk failure on sdp3, disabling device.
Dec 16 19:44:45 mxmx user.alert kernel: raid1: Operation continuing on 1 devices.
Dec 16 19:44:48 mxmx user.warn kernel: md: cannot remove active disk sdp3 from md1 ...
Dec 16 19:44:49 mxmx user.warn kernel: md: cannot remove active disk sdp3 from md1 ...
Dec 16 19:44:50 mxmx user.warn kernel: md: cannot remove active disk sdp3 from md1 ...
Dec 16 19:44:51 mxmx user.alert kernel: raid1: Disk failure on sdp4, disabling device.
Dec 16 19:44:51 mxmx user.alert kernel: raid1: Operation continuing on 1 devices.
Dec 16 19:44:51 mxmx user.info kernel: md: md2: recovery done.
Dec 16 19:44:51 mxmx user.info kernel: md: recovery of RAID array md1
Dec 16 19:44:51 mxmx user.info kernel: md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
Dec 16 19:44:51 mxmx user.info kernel: md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for recovery.
Dec 16 19:44:51 mxmx user.warn kernel: RAID1 conf printout:
Dec 16 19:44:51 mxmx user.info kernel: md: using 128k window, over a total of 4200896 blocks.
Dec 16 19:44:51 mxmx user.info kernel: md: resuming recovery of md1 from checkpoint.
Dec 16 19:44:51 mxmx user.info kernel: md: md1: recovery done.
Dec 16 19:44:51 mxmx user.warn kernel:  --- wd:1 rd:2
Dec 16 19:44:51 mxmx user.warn kernel:  disk 0, wo:0, o:1, dev:sdg4
Dec 16 19:44:51 mxmx user.warn kernel: RAID1 conf printout:
Dec 16 19:44:51 mxmx user.warn kernel:  --- wd:1 rd:2
Dec 16 19:44:51 mxmx user.warn kernel:  disk 0, wo:0, o:1, dev:sdg3
Dec 16 19:44:51 mxmx user.warn kernel:  disk 1, wo:1, o:0, dev:sdp3
Dec 16 19:44:51 mxmx user.warn kernel:  disk 1, wo:1, o:0, dev:sdp4
Dec 16 19:44:51 mxmx user.warn kernel: RAID1 conf printout:
Dec 16 19:44:51 mxmx user.warn kernel:  --- wd:1 rd:2
Dec 16 19:44:51 mxmx user.warn kernel: RAID1 conf printout:
Dec 16 19:44:51 mxmx user.warn kernel:  --- wd:1 rd:2
Dec 16 19:44:51 mxmx user.warn kernel:  disk 0, wo:0, o:1, dev:sdg3
Dec 16 19:44:51 mxmx user.warn kernel:  disk 0, wo:0, o:1, dev:sdg4
Dec 16 19:44:53 mxmx user.info kernel: md: unbind<sdp4>
Dec 16 19:44:53 mxmx user.info kernel: md: export_rdev(sdp4)



3) the script (note if/when running to adjust partition (and if needed, md) numbers to match your system. Actual drive names are passed in as arguments. See usage and example usage above. "modes" which fail are "bug" and "bug3part". The mode which reliably works with any # of partitions is "fault_all_first".

#!/bin/sh

do_md()
{
    md=$1
    shift
    op=$1
    shift
    opdisk=$1
    shift
    opdisk2=$1

    # customize the 'pp' partitions used for each of the 'md' devices here:
    case $md in
        0) pp=1 ;;
        1) pp=3 ;;
        2) pp=4 ;;
        3) pp=5 ;;
        *)
            echo "unsupported md dev: $md"
            return
            ;;
    esac

    retry="y"
    case $op in
        "stop")
	    retry="n"
            cmd="mdadm --verbose  --stop /dev/md$md"
            ;;
        "create")
	    retry="n"
            cmd="mdadm --create /dev/md$md  --force --run --level=1 \
--raid-devices=2 ${opdisk}$pp  ${opdisk2}$pp --assume-clean "
            ;;
        *)
            if [ "$op" = "add" ]; then
                mdadm --zero-superblock $opdisk$pp
            fi
            cmd="mdadm --verbose  --manage /dev/md$md --$op   ${opdisk}$pp"
            ;;
    esac
    echo "$cmd"
    retries=3
    i=$retries
    while [ $i -gt 0 ]; do
        $cmd
	cmdstatus=$?
	if [ $cmdstatus -eq 0 ]; then
	    break
	fi
	[ "$retry" != "y" ] && break
        i=$((i-1))
	sleep 1
    done
    if [ $cmdstatus -eq 0 ]; then
        echo "Succeeded after $((retries-i)) retries"
    else
        echo "ERROR: Command failed after $retries retries ($cmd)"
    fi
}

do_md_all()
{
    op=$1
    shift
    opdisk=$1
    shift
    opdisk2=$1

    for md in $mdlist ; do
	do_md $md $op $opdisk $opdisk2
    done
}

do_mode_bug()
{
    for md in $mdlist ; do
	do_md $md fail $fdisk
	sleep 2
	do_md $md remove $fdisk
    done

    for md in $mdlist ; do
	do_md $md add $rdisk
    done

    sleep 1
    cat /proc/mdstat

    for md in $mdlist ; do
	do_md $md fail $rdisk
	sleep 2
	do_md $md remove $rdisk
    done
}

do_mode_fault_all_first()
{
    do_md_all "fail" $fdisk
    do_md_all "remove" $fdisk

    # regen to spare:

    do_md_all "add" $rdisk

    # failover spare:

    sleep 1
    cat /proc/mdstat

    do_md_all "fail" $rdisk
#    sleep 1
    cat /proc/mdstat
#    sleep 1
    do_md_all "remove" $rdisk
}

usage()
{
    cat <<EOF
Usage: $0 [OPTIONS]
-m <arg>  : run in mode <arg>; supported are:
            "bug", "fault_all_first", "bug1part",
            "bug2part", "bug3part"
-stop     : first stop any running arrays previously setup by
            this program on the given disks
-create   : create needed running md's on the given disks (otherwise
            we assume they're already running)
-dg1 <arg> : use <arg> as the 1st "good" disk
-dg2 <arg> : use <arg> as the 2nd "good" disk
-df <arg>  : use <arg> as the "failed" disk
-dr <arg>  : use <arg> as the "replacement" disk
EOF
}

mdlist="0 1 2 3"

while [ $# -gt 0 ]; do
    case "$1" in
	-m)
	    case "$2" in
		bug1part)
		    mdlist="0"
		    mode=$2 ;;
		bug2part)
		    mdlist="0 1"
		    mode=$2 ;;
		bug3part)
		    mdlist="0 1 2"
		    mode=$2 ;;
		bug|\
		fault_all_first)
		    mode=$2 ;;
		*)
		    usage
		    exit 1 ;;
	    esac
            shift
            ;;
	-dg1)
            gdisk1="$2"
            shift
            ;;
	-dg2)
            gdisk2="$2"
            shift
            ;;
	-df)
            fdisk="$2"
            shift
            ;;
	-dr)
            rdisk="$2"
            shift
            ;;
	-h|--help|-help)
            usage
            exit 0
            ;;
	-stop)
            v_stop=1
            ;;
	-create)
            v_create=1
            ;;
	*)
            usage
	    echo "Unknown argument: $1"
            exit 1
            ;;
    esac
    shift;
done


[ -n "$gdisk1" -a -n "$gdisk2" -a -n "$fdisk" -a -n "$rdisk" ] || {
    usage
    echo "Need all 4 disks supplied (good[12]/failed/replacement)"
    exit 1
}

if [ -n "$v_stop" ]; then
    for md in $mdlist ; do
	do_md $md "stop"
    done
fi

sleep 1

if [ -n "$v_create" ]; then
    for md in $mdlist ; do
	do_md $md "create" $gdisk1 $gdisk2
    done
fi

sleep 1

if [ -z "$mode" ]; then
    echo "No mode provided; exiting"
    exit 0
fi

case $mode in
    bug*) do_mode_bug ;;
    *) do_mode_$mode ;;
esac



Thanks,
Brett
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux RAID Wiki]     [ATA RAID]     [Linux SCSI Target Infrastructure]     [Linux Block]     [Linux IDE]     [Linux SCSI]     [Linux Hams]     [Device Mapper]     [Device Mapper Cryptographics]     [Kernel]     [Linux Admin]     [Linux Net]     [GFS]     [RPM]     [git]     [Yosemite Forum]


  Powered by Linux