Lockup: 4.18, raid6 scrub vs raid6check

Chris Dunlop <chris@xxxxxxxxxxxx> · Thu, 25 Oct 2018 23:44:30 +1100

Hi,

kernel: v4.18.16
mdadm: current HEAD, 5d518de

It looks like there's some issue between scrubbing and writing to 
suspend_lo and suspend_hi, e.g. an inverted lock or missed wakeup etc.

I had an md lockup on a production machine when running a scrub and 
raid6check on the same md device at the same time. I eventually had to 
reset the box to recover.

At the point of the hang md1_raid6 was chewing up a lot of cpu but not 
making any progress per /sys/block/md1/md/sync_completed), and the 
raid6check was unkillable (kill -9 didn't work) and lsof showed it had 
/sys/devices/virtual/block/md1/md/suspend_lo open for write.

The raid6check code writes to suspend_lo and suspend_hi in it's 
lock_stripe() routine, to lock each stripe in turn as it works it's way 
through the md device.

I'm able to reproduce the lockup on a debian9 single cpu kvm virtual 
machine in 2-10 rounds of the reproducer below.

The reproducer prints dots at intervals on the order of a few seconds. If 
the problem is hit, the dots stop coming. At that point the shell should 
have suspend_lo or suspend_hi open for write, and will be unkillable.

Cheers,

Chris

----------------------------------------------------------------------
#
# Setup
#
# Create 6 x 11-dev raid6, wait for sync to finish
#
function test_setup
{
 for md in md{1..6}; do
   echo "creating ${md}"
   for i in {1..11}; do
     f=/var/tmp/${md}-vdev${i}
     truncate -s 2G "${f}"
     loop[$i]=$(losetup -f)
     losetup "${loop[$i]}" "${f}"
   done
   mdadm --create "/dev/${md}" --level=6 --raid-disks=11 "${loop[@]}"
 done
 while grep resync /proc/mdstat; do sleep 2; done
 cat /proc/mdstat
}

#
# Reproducer
#
# Continuous scrub of all mds, and lock successive stripes of md1 per 
# raid6check:lock_stripe()
#
function test_run
{
 declare -i component_size=$(($(</sys/block/md1/md/component_size) * 1024)) # KB to bytes
 declare -i chunk_size=$(</sys/block/md1/md/chunk_size)
 declare -i stripes=$((component_size / chunk_size))
 declare -i data_disks=$(($(</sys/block/md1/md/raid_disks) - 2))
 declare -i i=0 j stripe

 while : ; do
   i=$((i + 1))
   date +"%F-%T Round $i"

   #
   # Start scrub on all mds
   #
   for md in md{1..6}; do
     echo check > "/sys/block/${md}/md/sync_action"
   done
   sleep 2

   #
   # keep writing to md1 suspend_{lo,hi} as raid6check does
   #
   j=0
   while grep -q check /proc/mdstat; do
     j=$((j + 1))
     echo  -e "  $j \c"
     stripe=0
     while [[ stripe -le stripes ]] ; do
	[[ $((stripe % 10)) -eq 0 ]] && echo -e '.\c'
	echo $((stripe * chunk_size * data_disks)) > /sys/devices/virtual/block/md1/md/suspend_lo
	echo $(((stripe + 1) * chunk_size * data_disks)) > /sys/devices/virtual/block/md1/md/suspend_hi
	sleep 0.2
	stripe+=1
     done
     echo
   done
 done
}
----------------------------------------------------------------------