>Can you share your exact test scripts? I'm having a hard time reproducing this with something like: >echo 100000 > /proc/sys/dev/raid/speed_limit_min >mdadm --add /dev/md0 /dev/sd[bc]; dd if=urandom.dump of=/dev/md0 >bs=1024M oflag=sync Below is the script which will simulate 2 drive failure in RAID 6. And for running IO, we used Dit32. But you can use any Data verification tool, which does write some known pattern and read back the data and verify. #--------------------Script Which Simulates 2 Drive failure in Raid 6 Array - Input will be md array name [ Eg : /dev/md0 ] -------------------------------------------------------------------------------------------------- #!/bin/bash if [ -e /volumes/RAIDCONF ] then echo "Debug directory is present." else echo "Debug directory is not present. Created one" mkdir -p /volumes/RAIDCONF fi ld_name="/dev/md0" # Check if LD exists if [ -e $ld_name ] then echo "LD $ld_name exists" else echo "LD $ld_name does not exist" exit fi one=1 two=2 three=3 four=4 six=6 echo "`date` : Initial State" >> /volumes/RAIDCONF/raid_conf_info.txt echo "`date` : `mdadm -D $ld_name`" >> /volumes/RAIDCONF/raid_conf_info.txt md_name=`basename $ld_name` # Check if raid is initializing md_init=`cat /proc/mdstat | grep $md_name -A 3 | grep -o resync` if [ -z $md_init ] then echo "$md_name is online" else echo "RAID is Initializing. Speeding up the process" echo 100000 > /sys/devices/virtual/block/$md_name/md/sync_speed_max echo 100000 > /sys/devices/virtual/block/$md_name/md/sync_speed_min # Wait till Initialization completes while [ 1 ] do init_still=`cat /proc/mdstat | grep $md_name -A 3 | grep -o [0-9]*.[0-9]*%` if [ -z $init_still ] then echo "Initialzing of $md_name completed" break else echo "Initializing completed so far: $init_still" sleep 5 fi done fi # Reset back sync speed echo 1000 > /sys/devices/virtual/block/$md_name/md/sync_speed_max echo 1000 > /sys/devices/virtual/block/$md_name/md/sync_speed_min COUNT=10 i=1; # Loop forever while [ 1 ] do echo "###########################################################" echo "Rolling Hot Spare Test two drive removal Running for Iteration : $i" >> Iterations.txt echo "Rolling Hot Spare Test two drive removal Running for Iteration : $i" echo "###########################################################" #calculating length of the array arrlen=`mdadm -D $ld_name | grep "Active Devices" | awk '{print $4}'` echo "Original Lenght of array: $arrlen"; let arrlen=$arrlen-1; echo "Length of Array: $arrlen"; #PD list in an array pds=`cat /proc/mdstat | grep $md_name | grep -o 'sd[a-z]' | awk '{print "/dev/"$1}'` arr=($pds) #Random Number Generation ran1=`grep -m1 -ao '[0-'$arrlen']' /dev/urandom | head -1` echo "Randon Number1: $ran1"; ran2=`grep -m1 -ao '[0-'$arrlen']' /dev/urandom | head -1` #echo "Temp Randon Number2: $ran2"; while [ $ran1 -eq $ran2 ] do ran2=`grep -m1 -ao '[0-'$arrlen']' /dev/urandom | head -1` done echo "Randon Number2: $ran2"; ########## REMOVING TWO RANDOM DRIVES ############################################################################################### echo "Random Drive1 to be removed: ${arr[$ran1]}" echo "Random Drive1 to be removed: ${arr[$ran1]}" >> Iterations.txt # Removing drive1 Randomly echo "`date` : Iteration : $COUNT" >> /volumes/RAIDCONF/raid_conf_info.txt # Find the scsi address of the current disk scsi_address1=`lsscsi | grep ${arr[$ran1]} | grep -o [0-9]*:[0-9]*:[0-9]*:[0-9]*` disk1=`basename ${arr[$ran1]}` echo "Disk Name: $disk1"; td_name=`echo "$disk1" | cut -c 3` #Removing data partition echo "Removing data partition.." mdadm --manage $ld_name --set-faulty ${arr[$ran1]}$six sleep 10 faulty=`mdadm -D $ld_name | grep "faulty spare" | wc -l` echo " faulty = $faulty" if [ $faulty -ne 1 ]; then echo "Number of failed disk is more the expected" exit 1; fi faulty=0 isremoved=1 while [[ $isremoved -ne 0 ]] do echo "in while" faulty=`mdadm -D $ld_name | grep "faulty spare" | grep ${arr[$ran1]}$six | wc -l` echo " in - faulty = $faulty" if [ $faulty -eq 1 ]; then mdadm --manage $ld_name --remove ${arr[$ran1]}$six sleep 3 fi isremoved=$faulty echo "isremoved = $isremoved" done sleep 2; echo "`date` : After Removing the disk : Slot [ $slot1 ] Name[ ${arr[$ran1]} ] scsi_address[ $scsi_address1 ]" >> /volumes/RAIDCONF/raid_conf_info.txt echo "`date` : `mdadm -D $ld_name`" >> /volumes/RAIDCONF/raid_conf_info.txt echo "Random Drive2 to be removed: ${arr[$ran2]}" echo "Random Drive2 to be removed: ${arr[$ran2]}" >> Iterations.txt # Removing drive2 Randomly echo "`date` : Iteration : $COUNT" >> /volumes/RAIDCONF/raid_conf_info.txt # Find the scsi address of the current disk scsi_address2=`lsscsi | grep ${arr[$ran2]} | grep -o [0-9]*:[0-9]*:[0-9]*:[0-9]*` disk2=`basename ${arr[$ran2]}` echo "Disk Name: $disk2"; td_name=`echo "$disk2" | cut -c 3` #Removing data partition echo "Removing data partition.." mdadm --manage $ld_name --set-faulty ${arr[$ran2]}$six sleep 10 faulty=`mdadm -D $ld_name | grep "faulty spare" | wc -l` echo " faulty = $faulty" if [ $faulty -ne 1 ]; then echo "Number of failed disk is more the expected" exit 1; fi faulty=0 isremoved=1 while [[ $isremoved -ne 0 ]] do echo "in while" faulty=`mdadm -D $ld_name | grep "faulty spare" | grep ${arr[$ran2]}$six | wc -l` echo " in - faulty = $faulty" if [ $faulty -eq 1 ]; then mdadm --manage $ld_name --remove ${arr[$ran2]}$six sleep 3 fi isremoved=$faulty echo "isremoved = $isremoved" done echo "`date` : After Removing the disk : Slot [ $slot2 ] Name[ ${arr[$ran2]} ] scsi_address[ $scsi_address2 ]" >> /volumes/RAIDCONF/raid_conf_info.txt echo "`date` : `mdadm -D $ld_name`" >> /volumes/RAIDCONF/raid_conf_info.txt ####################################################################################################################################### #### Adding First removed drive ############################################################################################# # Add back the device removed first after rebuild completes sleep 5; mdadm $ld_name -a ${arr[$ran1]}$six # Wait for some time to get it added as spare sleep 5; echo "`date` : After Disk Added at Slot [ $slot1 ] scsi_address [$scsi_address1]">> /volumes/RAIDCONF/raid_conf_info.txt echo "`date` : `mdadm -D $ld_name`" >> /volumes/RAIDCONF/raid_conf_info.txt # Check if md starts rebuilding while [ 1 ] do md_recovery=`cat /proc/mdstat | grep $md_name -A 3 | grep -o recovery` if [ -z $md_recovery ] then echo "$md_name did not start rebuilding. sleeping and checking again" sleep 5 else break fi done sleep 5 echo "`date` : After Rebuild Started">> /volumes/RAIDCONF/raid_conf_info.txt echo "`date` : `mdadm -D $ld_name`" >> /volumes/RAIDCONF/raid_conf_info.txt echo "RAID is Rebuilding. Speeding up the process" echo 100000 > /sys/devices/virtual/block/$md_name/md/sync_speed_max echo 100000 > /sys/devices/virtual/block/$md_name/md/sync_speed_min # Wait till rebuilding while [ 1 ] do rb_still=`cat /proc/mdstat | grep $md_name -A 3 | grep -o [0-9]*.[0-9]*%` if [ -z $rb_still ] then echo "`date +%H:%M:%S`:Rebuild of $md_name completed" break else echo "Rebuild completed so far: $rb_still" sleep 5 fi done sleep 5 echo "`date` : After Rebuild complete">> /volumes/RAIDCONF/raid_conf_info.txt echo "`date` : `mdadm -D $ld_name`" >> /volumes/RAIDCONF/raid_conf_info.txt ################################################################################################################################## # Reset back sync speed echo 1000 > /sys/devices/virtual/block/$md_name/md/sync_speed_max echo 1000 > /sys/devices/virtual/block/$md_name/md/sync_speed_min echo "Wait after rebuilding disk 1" sleep 5 ######## Adding back the Second Removed Drive ################################################################################### mdadm $ld_name -a ${arr[$ran2]}$six # Wait for some time to get it added as spare sleep 5; echo "`date` : After Disk Added at Slot [ $slot2 ] scsi_address [$scsi_address2]">> /volumes/RAIDCONF/raid_conf_info.txt echo "`date` : `mdadm -D $ld_name`" >> /volumes/RAIDCONF/raid_conf_info.txt # Check if md starts rebuilding while [ 1 ] do md_recovery=`cat /proc/mdstat | grep $md_name -A 3 | grep -o recovery` if [ -z $md_recovery ] then echo "$md_name did not start rebuilding. sleeping and checking again" sleep 5 else break fi done sleep 5 echo "`date` : After Rebuild Started">> /volumes/RAIDCONF/raid_conf_info.txt echo "`date` : `mdadm -D $ld_name`" >> /volumes/RAIDCONF/raid_conf_info.txt echo "RAID is Rebuilding. Speeding up the process" echo 100000 > /sys/devices/virtual/block/$md_name/md/sync_speed_max echo 100000 > /sys/devices/virtual/block/$md_name/md/sync_speed_min # Wait till rebuilding while [ 1 ] do rb_still=`cat /proc/mdstat | grep $md_name -A 3 | grep -o [0-9]*.[0-9]*%` if [ -z $rb_still ] then echo "`date +%H:%M:%S`:Rebuild of $md_name completed" break else echo "Rebuild completed so far: $rb_still" sleep 5 fi done sleep 5 echo "`date` : After Rebuild complete">> /volumes/RAIDCONF/raid_conf_info.txt echo "`date` : `mdadm -D $ld_name`" >> /volumes/RAIDCONF/raid_conf_info.txt ################################################################################################################################## # Reset back sync speed echo 1000 > /sys/devices/virtual/block/$md_name/md/sync_speed_max echo 1000 > /sys/devices/virtual/block/$md_name/md/sync_speed_min echo "=============================================================================================" >>/volumes/RAIDCONF/raid_conf_info.txt ((COUNT=$COUNT+1)) echo "Rolling Hot Spare Test two drive removal iteration $i complete....">> Iterations.txt echo "Rolling Hot Spare Test two drive removal iteration $i complete...." let i=$i+1; echo "Sleeping for 300 seconds.." ##################################################################################################################################### sleep 20; done #while #-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Manibalan. -----Original Message----- From: dan.j.williams@xxxxxxxxx [mailto:dan.j.williams@xxxxxxxxx] On Behalf Of Dan Williams Sent: Tuesday, May 20, 2014 5:52 AM To: NeilBrown Cc: Manibalan P; linux-raid Subject: Re: raid6 - data integrity issue - data mis-compare on rebuilding RAID 6 - with 100 Mb resync speed. On Fri, May 16, 2014 at 11:11 AM, Dan Williams <dan.j.williams@xxxxxxxxx> wrote: > On Mon, May 5, 2014 at 12:21 AM, NeilBrown <neilb@xxxxxxx> wrote: >> On Wed, 23 Apr 2014 10:02:00 -0700 Dan Williams >> <dan.j.williams@xxxxxxxxx> >> wrote: >> >>> On Wed, Apr 23, 2014 at 12:07 AM, NeilBrown <neilb@xxxxxxx> wrote: >>> > On Fri, 11 Apr 2014 17:41:12 +0530 "Manibalan P" >>> > <pmanibalan@xxxxxxxxxxxxxx> >>> > wrote: >>> > >>> >> Hi Neil, >>> >> >>> >> Also, I found the data corruption issue on RHEL 6.5. >>> >> >>> >> For your kind attention, I up-ported the md code [raid5.c + >>> >> raid5.h] from FC11 kernel to CentOS 6.4, and there is no >>> >> mis-compare with the up-ported code. >>> > >>> > This narrows it down to between 2.6.29 and 2.6.32 - is that correct? >>> > >>> > So it is probably the change to RAID6 to support async parity calculations. >>> > >>> > Looking at the code always makes my head spin. >>> > >>> > Dan : have you any ideas? >>> > >>> > It seems that writing to a double-degraded RAID6 while it is >>> > recovering to a space can trigger data corruption. >>> > >>> > 2.6.29 works >>> > 2.6.32 doesn't >>> > 3.8.0 still doesn't. >>> > >>> > I suspect async parity calculations. >>> >>> I'll take a look. I've had cleanups of that code on my backlog for >>> "a while now (TM)". >> >> >> Hi Dan, >> did you have a chance to have a look? >> >> I've been consistently failing to find anything. >> >> I have a question though. >> If we set up a chain of async dma handling via: >> ops_run_compute6_2 then ops_bio_drain then ops_run_reconstruct >> >> is it possible for the ops_complete_compute callback set up by >> ops_run_compute6_2 to be called before ops_run_reconstruct has been >> scheduled or run? > > In the absence of a dma engine we never run asynchronously, so we will > *always* call ops_complete_compute() before ops_run_reconstruct() in > the synchronous acse. This looks confused. We're certainly leaking > an uptodate state prior to the completion of the write. > >> If so, there seems to be some room for confusion over the setting for >> R5_UPTODATE on blocks that are being computed and then drained to. >> Both will try to set the flag, so it could get set before reconstruction has run. >> >> I can't see that this would cause a problem, but then I'm not >> entirely sure why we clear R5_UPTODATE when we set R5_Wantdrain. > > Let me see what problems this could be causing. I'm thinking we > should be protected by the global ->reconstruct_state, but something > is telling me we do depend on R5_UPTODATE being consistent with the > ongoing stripe operation. > Can you share your exact test scripts? I'm having a hard time reproducing this with something like: echo 100000 > /proc/sys/dev/raid/speed_limit_min mdadm --add /dev/md0 /dev/sd[bc]; dd if=urandom.dump of=/dev/md0 bs=1024M oflag=sync This is a 7-drive raid6 array. ��.n��������+%������w��{.n�����{����w��ܨ}���Ơz�j:+v�����w����ޙ��&�)ߡ�a����z�ޗ���ݢj��w�f