RE: raid6 - data integrity issue - data mis-compare on rebuilding RAID 6 - with 100 Mb resync speed.

"Manibalan P" <pmanibalan@xxxxxxxxxxxxxx> · Thu, 22 May 2014 17:22:34 +0530

>Can you share your exact test scripts?  I'm having a hard time reproducing this with something like:

>echo 100000 > /proc/sys/dev/raid/speed_limit_min
>mdadm --add /dev/md0 /dev/sd[bc]; dd if=urandom.dump of=/dev/md0 
>bs=1024M oflag=sync

Below is the script which will simulate 2 drive failure in RAID 6. 
And for running IO, we used Dit32. But you  can use any
Data verification tool, which does write some known pattern and read back the data and verify.

#--------------------Script Which Simulates 2 Drive failure in Raid 6 Array - Input will be md array name [ Eg : /dev/md0 ] --------------------------------------------------------------------------------------------------
#!/bin/bash

if [ -e /volumes/RAIDCONF ]
then
	echo "Debug directory is present."
else
	echo "Debug directory is not present. Created one"
	mkdir -p /volumes/RAIDCONF
fi

ld_name="/dev/md0"
# Check if LD exists
if [ -e $ld_name ]
then
	echo "LD $ld_name exists"
else
	echo "LD $ld_name does not exist"
	exit
fi
one=1
two=2
three=3
four=4
six=6

echo "`date` : Initial State" >> /volumes/RAIDCONF/raid_conf_info.txt
echo "`date` : `mdadm -D $ld_name`" >> /volumes/RAIDCONF/raid_conf_info.txt

md_name=`basename $ld_name`

# Check if raid is initializing
md_init=`cat /proc/mdstat | grep $md_name -A 3 | grep -o resync`

if [ -z $md_init ]
then
	echo "$md_name is online"
else
	echo "RAID is Initializing. Speeding up the process"
	echo 100000 > /sys/devices/virtual/block/$md_name/md/sync_speed_max
	echo 100000 > /sys/devices/virtual/block/$md_name/md/sync_speed_min

	# Wait till Initialization completes
	while [ 1 ]
	do 
		init_still=`cat /proc/mdstat | grep $md_name -A 3 | grep -o [0-9]*.[0-9]*%`
		if [ -z $init_still ]
		then
			echo "Initialzing of $md_name completed"
			break
		else
			echo "Initializing completed so far: $init_still"
			sleep 5
		fi
	done
fi

# Reset back sync speed
echo 1000 > /sys/devices/virtual/block/$md_name/md/sync_speed_max
echo 1000 > /sys/devices/virtual/block/$md_name/md/sync_speed_min
COUNT=10
i=1;
# Loop forever
while [ 1 ]
do
echo "###########################################################"
echo "Rolling Hot Spare Test two drive removal Running for Iteration : $i" >> Iterations.txt
echo "Rolling Hot Spare Test two drive removal Running for Iteration : $i"
echo "###########################################################"
	#calculating length of the array
	arrlen=`mdadm -D $ld_name | grep "Active Devices" | awk '{print $4}'`
	echo "Original Lenght of array: $arrlen";
	let arrlen=$arrlen-1;
	echo "Length of Array: $arrlen";

	#PD list in an array
	pds=`cat /proc/mdstat | grep $md_name | grep -o 'sd[a-z]' | awk '{print "/dev/"$1}'`
	arr=($pds)

	#Random Number Generation
	ran1=`grep -m1 -ao '[0-'$arrlen']' /dev/urandom | head -1`
	echo "Randon Number1: $ran1";
	ran2=`grep -m1 -ao '[0-'$arrlen']' /dev/urandom | head -1`
	#echo "Temp Randon Number2: $ran2";
	while [ $ran1 -eq $ran2 ]
	do
	ran2=`grep -m1 -ao '[0-'$arrlen']' /dev/urandom | head -1`
	done
	echo "Randon Number2: $ran2";

##########  REMOVING TWO RANDOM DRIVES  ###############################################################################################
echo "Random Drive1 to be removed: ${arr[$ran1]}"
echo "Random Drive1 to be removed: ${arr[$ran1]}" >> Iterations.txt

        # Removing drive1 Randomly
		echo "`date` : Iteration : $COUNT" >> /volumes/RAIDCONF/raid_conf_info.txt
		# Find the scsi address of the current disk
		scsi_address1=`lsscsi | grep ${arr[$ran1]} | grep -o [0-9]*:[0-9]*:[0-9]*:[0-9]*`
		disk1=`basename ${arr[$ran1]}`
		echo "Disk Name: $disk1";
		td_name=`echo "$disk1" | cut -c 3`

		#Removing data partition
		echo "Removing data partition.."
		mdadm --manage $ld_name --set-faulty  ${arr[$ran1]}$six
		sleep 10

		faulty=`mdadm -D $ld_name | grep "faulty spare" | wc -l` 
		echo " faulty = $faulty"

		if [ $faulty -ne 1 ]; then
		echo "Number of failed disk is more the expected"
		exit 1;
		fi

		faulty=0
		isremoved=1
		while [[ $isremoved -ne 0 ]]
		do
		echo "in while"		
		faulty=`mdadm -D $ld_name | grep "faulty spare" | grep ${arr[$ran1]}$six | wc -l`
		echo " in - faulty = $faulty"
		if [ $faulty -eq 1 ]; then
		mdadm --manage $ld_name --remove ${arr[$ran1]}$six
		sleep 3
		fi
		isremoved=$faulty
		echo "isremoved = $isremoved"
		done

		sleep 2;

		echo "`date` : After Removing the disk : Slot [ $slot1 ] Name[ ${arr[$ran1]} ] scsi_address[ $scsi_address1 ]" >> /volumes/RAIDCONF/raid_conf_info.txt
		echo "`date` : `mdadm -D $ld_name`" >> /volumes/RAIDCONF/raid_conf_info.txt

echo "Random Drive2 to be removed: ${arr[$ran2]}"		
echo "Random Drive2 to be removed: ${arr[$ran2]}" >> Iterations.txt		
		 # Removing drive2 Randomly
		echo "`date` : Iteration : $COUNT" >> /volumes/RAIDCONF/raid_conf_info.txt
		# Find the scsi address of the current disk
		scsi_address2=`lsscsi | grep ${arr[$ran2]} | grep -o [0-9]*:[0-9]*:[0-9]*:[0-9]*`
		disk2=`basename ${arr[$ran2]}`
		echo "Disk Name: $disk2";

		td_name=`echo "$disk2" | cut -c 3`

		#Removing data partition
		echo "Removing data partition.."
		mdadm --manage $ld_name --set-faulty  ${arr[$ran2]}$six 
		sleep 10

		faulty=`mdadm -D $ld_name | grep "faulty spare" | wc -l` 
		echo " faulty = $faulty"

		if [ $faulty -ne 1 ]; then
		echo "Number of failed disk is more the expected"
		exit 1;
		fi

		faulty=0
		isremoved=1
		while [[ $isremoved -ne 0 ]]
		do
		echo "in while"		
		faulty=`mdadm -D $ld_name | grep "faulty spare" | grep ${arr[$ran2]}$six | wc -l`
		echo " in - faulty = $faulty"
		if [ $faulty -eq 1 ]; then
		mdadm --manage $ld_name --remove ${arr[$ran2]}$six
		sleep 3
		fi
		isremoved=$faulty
		echo "isremoved = $isremoved"
		done

		echo "`date` : After Removing the disk : Slot [ $slot2 ] Name[ ${arr[$ran2]} ] scsi_address[ $scsi_address2 ]" >> /volumes/RAIDCONF/raid_conf_info.txt
		echo "`date` : `mdadm -D $ld_name`" >> /volumes/RAIDCONF/raid_conf_info.txt

#######################################################################################################################################		

####  Adding First removed drive  #############################################################################################

		# Add back the device removed first after rebuild completes
		sleep 5;
		mdadm  $ld_name -a  ${arr[$ran1]}$six

		# Wait for some time to get it added as spare
		sleep 5;
		echo "`date` :  After Disk Added at Slot [ $slot1 ] scsi_address [$scsi_address1]">> /volumes/RAIDCONF/raid_conf_info.txt
        echo "`date` : `mdadm -D $ld_name`" >> /volumes/RAIDCONF/raid_conf_info.txt

		# Check if md starts rebuilding
		while [ 1 ]
		do
			md_recovery=`cat /proc/mdstat | grep $md_name -A 3 | grep -o recovery`
			if [ -z $md_recovery ]
			then
				echo "$md_name did not start rebuilding. sleeping and checking again"
				sleep 5
			else
				break
			fi
		done

		sleep 5
		echo "`date` :	After Rebuild Started">> /volumes/RAIDCONF/raid_conf_info.txt
                echo "`date` : `mdadm -D $ld_name`" >> /volumes/RAIDCONF/raid_conf_info.txt

		echo "RAID is Rebuilding. Speeding up the process"
		echo 100000 > /sys/devices/virtual/block/$md_name/md/sync_speed_max
		echo 100000 > /sys/devices/virtual/block/$md_name/md/sync_speed_min

		# Wait till rebuilding
		while [ 1 ]
		do 
			rb_still=`cat /proc/mdstat | grep $md_name -A 3 | grep -o [0-9]*.[0-9]*%`
			if [ -z $rb_still ]
			then
				echo "`date +%H:%M:%S`:Rebuild of $md_name completed"
				break
			else
				echo "Rebuild completed so far: $rb_still"
				sleep 5
			fi
		done

		sleep 5
		echo "`date` :	After Rebuild complete">> /volumes/RAIDCONF/raid_conf_info.txt
                echo "`date` : `mdadm -D $ld_name`" >> /volumes/RAIDCONF/raid_conf_info.txt

##################################################################################################################################		
	# Reset back sync speed
	echo 1000 > /sys/devices/virtual/block/$md_name/md/sync_speed_max
	echo 1000 > /sys/devices/virtual/block/$md_name/md/sync_speed_min

	echo "Wait after rebuilding disk 1"
	sleep 5	

########  Adding back the Second Removed Drive ###################################################################################

		mdadm  $ld_name -a  ${arr[$ran2]}$six

		# Wait for some time to get it added as spare
		sleep 5;
		echo "`date` :  After Disk Added at Slot [ $slot2 ] scsi_address [$scsi_address2]">> /volumes/RAIDCONF/raid_conf_info.txt
        echo "`date` : `mdadm -D $ld_name`" >> /volumes/RAIDCONF/raid_conf_info.txt		

		# Check if md starts rebuilding
		while [ 1 ]
		do
			md_recovery=`cat /proc/mdstat | grep $md_name -A 3 | grep -o recovery`
			if [ -z $md_recovery ]
			then
				echo "$md_name did not start rebuilding. sleeping and checking again"
				sleep 5
			else
				break
			fi
		done

		sleep 5
		echo "`date` :	After Rebuild Started">> /volumes/RAIDCONF/raid_conf_info.txt
                echo "`date` : `mdadm -D $ld_name`" >> /volumes/RAIDCONF/raid_conf_info.txt

		echo "RAID is Rebuilding. Speeding up the process"
		echo 100000 > /sys/devices/virtual/block/$md_name/md/sync_speed_max
		echo 100000 > /sys/devices/virtual/block/$md_name/md/sync_speed_min

		# Wait till rebuilding
		while [ 1 ]
		do 
			rb_still=`cat /proc/mdstat | grep $md_name -A 3 | grep -o [0-9]*.[0-9]*%`
			if [ -z $rb_still ]
			then
				echo "`date +%H:%M:%S`:Rebuild of $md_name completed"
				break
			else
				echo "Rebuild completed so far: $rb_still"
				sleep 5
			fi
		done

		sleep 5
		echo "`date` :	After Rebuild complete">> /volumes/RAIDCONF/raid_conf_info.txt
                echo "`date` : `mdadm -D $ld_name`" >> /volumes/RAIDCONF/raid_conf_info.txt

##################################################################################################################################		
	# Reset back sync speed
	echo 1000 > /sys/devices/virtual/block/$md_name/md/sync_speed_max
	echo 1000 > /sys/devices/virtual/block/$md_name/md/sync_speed_min

	echo "=============================================================================================" >>/volumes/RAIDCONF/raid_conf_info.txt
	((COUNT=$COUNT+1))

echo "Rolling Hot Spare Test two drive removal iteration $i complete....">> Iterations.txt
echo "Rolling Hot Spare Test two drive removal iteration $i complete...."
let i=$i+1;
echo "Sleeping for 300 seconds.."
#####################################################################################################################################	
sleep 20;
done #while  
#--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Manibalan.

-----Original Message-----
From: dan.j.williams@xxxxxxxxx [mailto:dan.j.williams@xxxxxxxxx] On Behalf Of Dan Williams
Sent: Tuesday, May 20, 2014 5:52 AM
To: NeilBrown
Cc: Manibalan P; linux-raid
Subject: Re: raid6 - data integrity issue - data mis-compare on rebuilding RAID 6 - with 100 Mb resync speed.

On Fri, May 16, 2014 at 11:11 AM, Dan Williams <dan.j.williams@xxxxxxxxx> wrote:
> On Mon, May 5, 2014 at 12:21 AM, NeilBrown <neilb@xxxxxxx> wrote:
>> On Wed, 23 Apr 2014 10:02:00 -0700 Dan Williams 
>> <dan.j.williams@xxxxxxxxx>
>> wrote:
>>
>>> On Wed, Apr 23, 2014 at 12:07 AM, NeilBrown <neilb@xxxxxxx> wrote:
>>> > On Fri, 11 Apr 2014 17:41:12 +0530 "Manibalan P" 
>>> > <pmanibalan@xxxxxxxxxxxxxx>
>>> > wrote:
>>> >
>>> >> Hi Neil,
>>> >>
>>> >> Also, I found the data corruption issue on RHEL 6.5.
>>> >>
>>> >> For your kind attention, I up-ported the md code [raid5.c + 
>>> >> raid5.h] from FC11 kernel to CentOS 6.4, and there is no 
>>> >> mis-compare with the up-ported code.
>>> >
>>> > This narrows it down to between 2.6.29 and 2.6.32 - is that correct?
>>> >
>>> > So it is probably the change to RAID6 to support async parity calculations.
>>> >
>>> > Looking at the code always makes my head spin.
>>> >
>>> > Dan : have you any ideas?
>>> >
>>> > It seems that writing to a double-degraded RAID6 while it is 
>>> > recovering to a space can trigger data corruption.
>>> >
>>> > 2.6.29 works
>>> > 2.6.32 doesn't
>>> > 3.8.0 still doesn't.
>>> >
>>> > I suspect async parity calculations.
>>>
>>> I'll take a look.  I've had cleanups of that code on my backlog for 
>>> "a while now (TM)".
>>
>>
>> Hi Dan,
>>  did you have a chance to have a look?
>>
>> I've been consistently failing to find anything.
>>
>> I have a question though.
>> If we set up a chain of async dma handling via:
>>    ops_run_compute6_2 then ops_bio_drain then ops_run_reconstruct
>>
>> is it possible for the ops_complete_compute callback set up by
>> ops_run_compute6_2 to be called before ops_run_reconstruct has been 
>> scheduled or run?
>
> In the absence of a dma engine we never run asynchronously, so we will
> *always* call ops_complete_compute() before ops_run_reconstruct() in 
> the synchronous acse.  This looks confused.  We're certainly leaking 
> an uptodate state prior to the completion of the write.
>
>> If so, there seems to be some room for confusion over the setting for 
>> R5_UPTODATE on blocks that are being computed and then drained to.  
>> Both will try to set the flag, so it could get set before reconstruction has run.
>>
>> I can't see that this would cause a problem, but then I'm not 
>> entirely sure why we clear R5_UPTODATE when we set R5_Wantdrain.
>
> Let me see what problems this could be causing.  I'm thinking we 
> should be protected by the global ->reconstruct_state, but something 
> is telling me we do depend on R5_UPTODATE being consistent with the 
> ongoing stripe operation.
>

Can you share your exact test scripts?  I'm having a hard time reproducing this with something like:

echo 100000 > /proc/sys/dev/raid/speed_limit_min
mdadm --add /dev/md0 /dev/sd[bc]; dd if=urandom.dump of=/dev/md0 bs=1024M oflag=sync

This is a 7-drive raid6 array.
��.n��������+%������w��{.n�����{����w��ܨ}���Ơz�j:+v�����w����ޙ��&�)ߡ�a����z�ޗ���ݢj��w�f