On Mon, Nov 14, 2022 at 01:34:12PM -0700, Jonathan Derrick wrote: > + echo "Running ${TEST_NAME}" > + > + local sysfs > + local attr > + local m > + > + sysfs="$TEST_DEV_SYSFS/device" That's not the correct directory when the device is using native nvme-multipath. > + timeout=$(($(cat /proc/sys/kernel/hung_task_timeout_secs) / 2)) > + > + sleep 5 > + > + if [[ ! -d "$sysfs" ]]; then > + echo "$sysfs doesn't exist" > + fi > + > + # do reset controller/format loops > + # don't check status now because a timing race is desired > + i=0 > + start=0 > + timing_out=false > + while [[ $i -le 1000 ]]; do > + start=$SECONDS > + if [[ -f "$sysfs/reset_controller" ]]; then > + echo 1 > "$sysfs/reset_controller" 2>/dev/null & > + i=$((i+1)) > + fi > + nvme format -l 0 -f $TEST_DEV 2>/dev/null & > + > + #Assume the controller is hung and unrecoverable > + if [[ $(($SECONDS - $start)) -gt $timeout ]]; then > + echo "nvme controller timing out" > + timing_out=true > + break > + fi > + done If the controller is already undergoing a reset, then writing to the reset_controller file becomes a no-op. Unless your "reset_controller" completes near instantaneously, I find that this loop tears through 1000 iterations, forks 1000 formats, and only 1 reset_controller actually gets through. If I remove the upper limit, then I can also see the stalled task, but it is only temporary and gets itself out of it after the admin timeout (1 minute). Is that also your observation, or is it stuck forever?