On Wed, Jan 20, 2021 at 07:45:48PM +0100, mwilck@xxxxxxxx wrote: > From: Martin Wilck <mwilck@xxxxxxxx> > > Donald: please give this patch a try. > > Commit 6eb045e092ef ("scsi: core: avoid host-wide host_busy counter for scsi_mq") > contained this hunk: > > - busy = atomic_inc_return(&shost->host_busy) - 1; > if (atomic_read(&shost->host_blocked) > 0) { > - if (busy) > + if (scsi_host_busy(shost) > 0) > goto starved; > > The previous code would increase the busy count before checking host_blocked. > With 6eb045e092ef, the busy count would be increased (by setting the > SCMD_STATE_INFLIGHT bit) after the if clause for host_blocked above. > > Users have reported a regression with the smartpqi driver [1] which has been > shown to be caused by this commit [2]. > > It seems that by moving the increase of the busy counter further down, it could > happen that the can_queue limit of the controller could be exceeded if several > CPUs were executing this code in parallel on different queues. can_queue limit should never be exceeded because it is respected by blk-mq since each hw queue's queue depth is .can_queue. smartpqi's issue is that its .can_queue does not represent each hw queue's depth, instead the .can_queue represents queue depth of the whole HBA. As John mentioned, smartpqi should have switched to hosttags. BTW, looks the following code has soft lockup risk: pqi_alloc_io_request(): while (1) { io_request = &ctrl_info->io_request_pool[i]; if (atomic_inc_return(&io_request->refcount) == 1) break; atomic_dec(&io_request->refcount); i = (i + 1) % ctrl_info->max_io_slots; } > > This patch attempts to fix it by moving setting the SCMD_STATE_INFLIGHT before > the host_blocked test again. It also inserts barriers to make sure > scsi_host_busy() on once CPU will notice the increase of the count from another. > > [1]: https://marc.info/?l=linux-scsi&m=160271263114829&w=2 > [2]: https://marc.info/?l=linux-scsi&m=161116163722099&w=2 If the above is true wrt. smartpqi's can_queue usage, your patch may not fix the issue completely in which you think '.can_queue is exceeded'. > > Fixes: 6eb045e092ef ("scsi: core: avoid host-wide host_busy counter for scsi_mq") > > Cc: Ming Lei <ming.lei@xxxxxxxxxx> > Cc: Don Brace <Don.Brace@xxxxxxxxxxxxx> > Cc: Kevin Barnett <Kevin.Barnett@xxxxxxxxxxxxx> > Cc: Donald Buczek <buczek@xxxxxxxxxxxxx> > Cc: John Garry <john.garry@xxxxxxxxxx> > Cc: Paul Menzel <pmenzel@xxxxxxxxxxxxx> > Signed-off-by: Martin Wilck <mwilck@xxxxxxxx> > --- > drivers/scsi/hosts.c | 2 ++ > drivers/scsi/scsi_lib.c | 8 +++++--- > 2 files changed, 7 insertions(+), 3 deletions(-) > > diff --git a/drivers/scsi/hosts.c b/drivers/scsi/hosts.c > index 2f162603876f..1c452a1c18fd 100644 > --- a/drivers/scsi/hosts.c > +++ b/drivers/scsi/hosts.c > @@ -564,6 +564,8 @@ static bool scsi_host_check_in_flight(struct request *rq, void *data, > int *count = data; > struct scsi_cmnd *cmd = blk_mq_rq_to_pdu(rq); > > + /* This pairs with set_bit() in scsi_host_queue_ready() */ > + smp_mb__before_atomic(); So the above barrier orders atomic_read(&shost->host_blocked) and test_bit()? > if (test_bit(SCMD_STATE_INFLIGHT, &cmd->state)) > (*count)++; > > diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c > index b3f14f05340a..0a9a36c349ee 100644 > --- a/drivers/scsi/scsi_lib.c > +++ b/drivers/scsi/scsi_lib.c > @@ -1353,8 +1353,12 @@ static inline int scsi_host_queue_ready(struct request_queue *q, > if (scsi_host_in_recovery(shost)) > return 0; > > + set_bit(SCMD_STATE_INFLIGHT, &cmd->state); > + /* This pairs with test_bit() in scsi_host_check_in_flight() */ > + smp_mb__after_atomic(); > + > if (atomic_read(&shost->host_blocked) > 0) { > - if (scsi_host_busy(shost) > 0) > + if (scsi_host_busy(shost) > 1) > goto starved; > > /* > @@ -1379,8 +1383,6 @@ static inline int scsi_host_queue_ready(struct request_queue *q, > spin_unlock_irq(shost->host_lock); > } > > - __set_bit(SCMD_STATE_INFLIGHT, &cmd->state); > - Looks this patch fine. However, I'd suggest to confirm smartpqi's .can_queue usage first, which looks one big issue. -- Ming