Re: [PATCH] tests/nvme: Add admin-passthru+reset race test

Keith Busch <kbusch@xxxxxxxxxx> · Tue, 15 Nov 2022 16:07:24 -0700

On Mon, Nov 14, 2022 at 01:34:12PM -0700, Jonathan Derrick wrote:
> +	echo "Running ${TEST_NAME}"
> +
> +	local sysfs
> +	local attr
> +	local m
> +
> +	sysfs="$TEST_DEV_SYSFS/device"

That's not the correct directory when the device is using native
nvme-multipath.

> +	timeout=$(($(cat /proc/sys/kernel/hung_task_timeout_secs) / 2))
> +
> +	sleep 5
> +
> +	if [[ ! -d "$sysfs" ]]; then
> +		echo "$sysfs doesn't exist"
> +	fi
> +
> +	# do reset controller/format loops
> +	# don't check status now because a timing race is desired
> +	i=0
> +	start=0
> +	timing_out=false
> +	while [[ $i -le 1000 ]]; do
> +		start=$SECONDS
> +		if [[ -f "$sysfs/reset_controller" ]]; then
> +			echo 1 > "$sysfs/reset_controller" 2>/dev/null &
> +			i=$((i+1))
> +		fi
> +		nvme format -l 0 -f $TEST_DEV 2>/dev/null &
> +
> +		#Assume the controller is hung and unrecoverable
> +		if [[ $(($SECONDS - $start)) -gt $timeout ]]; then
> +			echo "nvme controller timing out"
> +			timing_out=true
> +			break
> +		fi
> +	done

If the controller is already undergoing a reset, then writing to the
reset_controller file becomes a no-op. Unless your "reset_controller"
completes near instantaneously, I find that this loop tears through 1000
iterations, forks 1000 formats, and only 1 reset_controller actually
gets through.

If I remove the upper limit, then I can also see the stalled task, but
it is only temporary and gets itself out of it after the admin timeout
(1 minute). Is that also your observation, or is it stuck forever?