[PATCH] scsi_debug: fix scp is NULL errors

Douglas Gilbert <dgilbert@xxxxxxxxxxxx> · Thu, 13 Aug 2020 11:57:38 -0400

John Garry reported 'sdebug_q_cmd_complete: scp is NULL' failures
that were mainly seen on aarch64 machines (e.g. RPi 4 with four
A72 CPUs). The problem was tracked down to a missing critical
section on a "short circuit" path. Namely, the time to process
the current command so far has already exceeded the requested
command duration (i.e. the number of nanoseconds in the ndelay
parameter).

The random=1 parameter setting was pivotal in finding this error.
The failure scenario involved first taking that "short circuit"
path (due to a very short command duration) and then taking the
more likely hrtimer_start() path (due to a longer command
duration). With random=1 each command's duration is taken from
the uniformly distributed [0..ndelay) interval.
The fio utility also helped by reliably generating the error
scenario at about once per minute on a RPi 4 (64 bit OS).

Reported-by: John Garry <john.garry@xxxxxxxxxx>
Signed-off-by: Douglas Gilbert <dgilbert@xxxxxxxxxxxx>
---
 drivers/scsi/scsi_debug.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/scsi/scsi_debug.c b/drivers/scsi/scsi_debug.c
index d95822dceeb6..4b4e31af22bd 100644
--- a/drivers/scsi/scsi_debug.c
+++ b/drivers/scsi/scsi_debug.c
@@ -5471,9 +5471,11 @@ static int schedule_resp(struct scsi_cmnd *cmnd, struct sdebug_dev_info *devip,
 				u64 d = ktime_get_boottime_ns() - ns_from_boot;
 
 				if (kt <= d) {	/* elapsed duration >= kt */
+					spin_lock_irqsave(&sqp->qc_lock, iflags);
 					sqcp->a_cmnd = NULL;
 					atomic_dec(&devip->num_in_q);
 					clear_bit(k, sqp->in_use_bm);
+					spin_unlock_irqrestore(&sqp->qc_lock, iflags);
 					if (new_sd_dp)
 						kfree(sd_dp);
 					/* call scsi_done() from this thread */
-- 
2.25.1