On 14/05/2022 10:49, John Garry wrote:
It could be an issue with the SCSI hba driver.
That seems likely to me.
Actually it is a LLDD problem. Sometimes it takes 45 minutes to trigger,
though – not nice to bisect.
This looks to be the problematic patch:
author John Garry <john.garry@xxxxxxxxxx> 2022-02-10 18:43:24 +0800
committer Martin K. Petersen <martin.petersen@xxxxxxxxxx> 2022-02-11
17:02:50 -0500
commit 26fc0ea74fcb9b76b41f5e9b89728cd1c01559cd (patch)
scsi: libsas: Drop SAS_TASK_AT_INITIATOR
If interested, this looks like the issue:
void hisi_sas_task_deliver(struct hisi_hba *hisi_hba,
break;
}
- spin_lock_irqsave(&task->task_state_lock, flags);
- task->task_state_flags |= SAS_TASK_AT_INITIATOR;
- spin_unlock_irqrestore(&task->task_state_lock, flags);
-
WRITE_ONCE(slot->ready, 1);
Losing the spinlock loses the barrier semantics as well, so a memory
ordering issue.
Sure, that would be common wisdom. However the commit before anything
related to driver was added for 5.18 is also bad. It could be
pre-existing, but that starts to seem unlikely. Or it could still be an
IOMMU issue - we already have a performance issue there.
This issue can take more than 15 minutes to occur, so is pretty painful
to bisect...