On Tue, 2021-01-19 at 10:33 +0000, John Garry wrote: > > > > > > Am 10.12.20 um 21:35 schrieb Don Brace: > > > > From: Mahesh Rajashekhara <mahesh.rajashekhara@xxxxxxxxxxxxx> > > > > > > > > * Correct scsi-mid-layer sending more requests than > > > > exposed host Q depth causing firmware ASSERT issue. > > > > * Add host Qdepth counter. > > > > > > This supposedly fixes the regression between Linux 5.4 and 5.9, > > > which > > > we reported in [1]. > > > > > > kernel: smartpqi 0000:89:00.0: controller is offline: > > > status code > > > 0x6100c > > > kernel: smartpqi 0000:89:00.0: controller offline > > > > > > Thank you for looking into this issue and fixing it. We are going > > > to > > > test this. > > > > > > For easily finding these things in the git history or the WWW, it > > > would be great if these log messages could be included (in the > > > future). > > > DON> Thanks for your suggestion. Well add them in the next time. > > > > > > Also, that means, that the regression is still present in Linux > > > 5.10, > > > released yesterday, and this commit does not apply to these > > > versions. > > > > > > DON> They have started 5.10-RC7 now. So possibly 5.11 or 5.12 > > > depending when all of the patches are applied. The patch in > > > question > > > is among 28 other patches. > > > > > > Mahesh, do you have any idea, what commit caused the regression > > > and > > > why the issue started to show up? > > > DON> The smartpqi driver sets two scsi_host_template member > > > fields: > > > .can_queue and .nr_hw_queues. But we have not yet converted to > > > host_tagset. So the queue_depth becomes nr_hw_queues * can_queue, > > > which is more than the hw can support. That can be verified by > > > looking > > > at scsi_host.h. > > > /* > > > * In scsi-mq mode, the number of hardware queues > > > supported by > > > the LLD. > > > * > > > * Note: it is assumed that each hardware queue has a > > > queue > > > depth of > > > * can_queue. In other words, the total queue depth per > > > host > > > * is nr_hw_queues * can_queue. However, for when > > > host_tagset > > > is set, > > > * the total queue depth is can_queue. > > > */ > > > > > > So, until we make this change, the queue_depth change prevents > > > the > > > above issue from happening. > > > > can_queue and nr_hw_queues have been set like this as long as the > > driver existed. Why did Paul observe a regression with 5.9? > > > > And why can't you simply set can_queue to (ctrl_info- > > >scsi_ml_can_queue / nr_hw_queues)? > > > > Don: I did this in an internal patch, but this patch seemed to work > > the best for our driver. HBA performance remained steady when > > running benchmarks. That was a stupid suggestion on my part. Sorry. > I guess that this is a fallout from commit 6eb045e092ef ("scsi: > core: avoid host-wide host_busy counter for scsi_mq"). But that > commit > is correct. It would be good if someone (Paul?) could verify whether that commit actually caused the regression they saw. Looking at that 6eb045e092ef, I notice this hunk: - busy = atomic_inc_return(&shost->host_busy) - 1; if (atomic_read(&shost->host_blocked) > 0) { - if (busy) + if (scsi_host_busy(shost) > 0) goto starved; Before 6eb045e092ef, the busy count was incremented with membarrier before looking at "host_blocked". The new code does this instead: @ -1403,6 +1400,8 @@ static inline int scsi_host_queue_ready(struct request_queue *q, spin_unlock_irq(shost->host_lock); } + __set_bit(SCMD_STATE_INFLIGHT, &cmd->state); + but it happens *after* the "host_blocked" check. Could that perhaps have caused the regression? Thanks Martin