On Mon, 26 May 2014 17:12:27 +0200 Bart Van Assche <bvanassche@xxxxxxx> wrote: > Every now and then someone asks how it is avoided that the SCSI error > handler and the SCSI completion handler are invoked concurrently for > the same SCSI command. Hence this patch series that should make the SCSI > error handler code a little easier to understand. Hi Bart, We talked about REQ_ATOM_COMPLETE a while back (so I may be one of those people who periodically ask about SCSI error handling / completion), and I just thought I'd chime in here to clarify the scenario we were worried about. Before d555a2ab "[SCSI] Fix spurious request sense in error handling" we saw the following sequence of events: [ 1394.345991] sd 2:0:1:0: [sdw] Done: [ 1394.361790] TIMEOUT [ 1394.374148] sd 2:0:1:0: [sdw] [ 1394.388775] Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK [ 1394.409068] sd 2:0:1:0: [sdw] CDB: [ 1394.424782] Read(10): 28 00 00 00 00 18 00 00 08 00 [ 1394.443772] sd 2:0:1:0: [sdw] scmd ffff88104dce02c0 abort scheduled [ 1394.474026] sd 2:0:1:0: [sdw] aborting command ffff88104dce02c0 [ 1394.573338] qla2xxx [0000:46:00.0]-801c:2: Abort command issued nexus=2:1:0 -- 1 2002. [ 1394.609313] sd 2:0:1:0: [sdw] scmd ffff88104dce02c0 retry aborted command [ 1425.397434] sd 2:0:1:0: [sdw] Done: [ 1425.417802] TIMEOUT [ 1425.431914] sd 2:0:1:0: [sdw] [ 1425.448247] Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK [ 1425.470736] sd 2:0:1:0: [sdw] CDB: [ 1425.488005] Read(10): 28 00 00 00 00 18 00 00 08 00 [ 1425.508472] sd 2:0:1:0: [sdw] scmd ffff88104dce02c0 previous abort failed [ 1425.533299] scsi_eh_2: waking up 0/1/1 [ 1425.551189] sd 2:0:1:0: scsi_eh_prt_fail_stats: cmds failed: 1, cancel: 0 [ 1425.575897] Total of 1 commands on 1 devices require eh work [ 1425.597970] sd 2:0:1:0: [sdw] scsi_eh_2: requesting sense ie, something like: CMD -> .queuecommand ... timeout ABORT -> .eh_abort_handler CMD (retry) -> .queuecommand ... timeout ABORT -> scsi_eh_scmd_add(scmd, SCSI_EH_CANCEL_CMD) ... into scsi_error_handler ... REQUEST SENSE -> .queuecommand We eventually found our way into scsi_unjam_host, which "*knows* that all commands on the bus have either completed, failed or timed out.", and then scsi_send_eh_cmnd, which "hijacks" a SCSI command structure. Presumably the second comment follows from the first. But in the case of timeout, when was the LLDD informed to abort or forget about that scsi command structure? In our testing, it appeared that the LLDD's .queuecommand was called back-to-back with the same scsi_cmnd pointer. The driver in question was qla2xxx, who keeps a driver-private structure to scsi_cmnd mapping (assumed to be one-to-one) that got confused by this sequence of events. After d555a2ab, in this scenario scsi_unjam_host skips the request sense and escalates to the next level (scsi_eh_abort_cmds), skipping straight to the abort as we would have expected... This avoids the scsi_cmnd "hijack"... but was it ever safe to do so in the first place? Or should LLDD expect to gracefully handle the same scsi_cmnd pointer coming into .queuecommand w/o an intermediate abort or other scsi_host_template callback? Regards, -- Joe -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html