On 3/29/22 11:06, Wenchao Hao wrote:
SCSI timeout would call scsi_eh_scmd_add() on some conditions, host would be set to SHOST_RECOVERY state. Once host enter SHOST_RECOVERY, IOs submitted to all devices in this host would not succeed until the scsi_error_handler() finished. The scsi_error_handler() might takes long time to be done, it's unbearable when host has massive devices. I want to ask is anyone applying another error handler flow to address this phenomenon? I think we can move some operations(like scsi get sense, scsi send startunit and scsi device reset) out of scsi_unjam_host(), to perform these operations without setting host to SHOST_RECOVERY? It would reduce the time of block the whole host. Waiting for your discussion.
We already have "async" aborts before even entering scsi_eh. So your use case seems to imply that those aborts fail and we enter scsi_eh?
There's eh_deadline for limiting the time spent in escalation of scsi_eh, and instead directly go to host reset. Would this help?
-- Mit freundlichen Gruessen / Kind regards Steffen Maier Linux on IBM Z and LinuxONE https://www.ibm.com/privacy/us/en/ IBM Deutschland Research & Development GmbH Vorsitzender des Aufsichtsrats: Gregor Pillen Geschaeftsfuehrung: David Faller Sitz der Gesellschaft: Boeblingen Registergericht: Amtsgericht Stuttgart, HRB 243294