On 1/12/24 08:00, Ming Lei wrote:
Inside scsi_eh_wakeup(), scsi_host_busy() is called & checked with host lock
every time for deciding if error handler kthread needs to be waken up.
This way can be too heavy in case of recovery, such as:
- N hardware queues
- queue depth is M for each hardware queue
- each scsi_host_busy() iterates over (N * M) tag/requests
If recovery is triggered in case that all requests are in-flight, each
scsi_eh_wakeup() is strictly serialized, when scsi_eh_wakeup() is called
for the last in-flight request, scsi_host_busy() has been run for (N * M - 1)
times, and request has been iterated for (N*M - 1) * (N * M) times.
If both N and M are big enough, hard lockup can be triggered on acquiring
host lock, and it is observed on mpi3mr(128 hw queues, queue depth 8169).
Fix the issue by calling scsi_host_busy() outside host lock, and we
don't need host lock for getting busy count because host lock never
covers that.
Can you share details for the hard lockup?
I do agree that scsi_host_busy() is an expensive operation, so it
might not be ideal to call it under a spin lock.
But I wonder where the lockup comes in here.
Care to explain?
And if it leads to a lockup, aren't other instances calling
scsi_host_busy() under a spinlock affected, as well?
Cheers,
Hannes