On 8/10/22 5:46 AM, Martin Wilck wrote: > Hi Mike, > > On Tue, 2022-08-09 at 22:41 -0500, Mike Christie wrote: >> In several places like LU setup and pr_ops we will hit the noretry >> code >> path because we do not retry any passthrough commands that hit device >> errors even though scsi-ml thinks the command is retryable. >> >> This has us only fast fail commands that hit device errors that have >> been >> submitted through the block layer directly and not via scsi_execute. >> This >> allows SG IO and other users to continue to get fast failures and all >> device errors returned to them, and scsi_execute users will get their >> retries they had requested. >> >> Signed-off-by: Mike Christie <michael.christie@xxxxxxxxxx> > > Thanks a lot. I like the general approach, but I am concerned that by > treating every command sent by scsi_execute() or scsi_execute_req() as > a retryable command, we may break some callers, or at least modify > their semantics in unexpected ways. For example, spi_execute(), > scsi_probe_lun(), scsi_report_lun_scan() currently retry only on UA. > With this change, these commands will be retried in additional cases, > without the caller noticing (see 949bf797595fc ("[SCSI] fix command > retries in spi_transport class"). It isn't obvious to me that this is > correct in all affected cases. Let's make sure we are on the same page, because I might agree with you. But for possible solutions we need to agree what this patch actually changes. We currently have 3 places we get retries from: 1. scsi_decide_disposition - For passthrough commands the patch only changes the behavior when scsi_decide_disposition gets NEED_RETRY, retries < allowed, and REQ_FAILFAST_DEV is not set. It's really specific and not as general as I think you thought it was. 2. scsi_io_completion - Passthrough commands are never retried here. 3. scsi_execute users driving retries. For your examples above: - The scan/probe functions ask for 3 retries and so with patch1 we will now get 3 x 3 retries for errors that hit #1. So I agree this is really wrong for DID_TIME_OUT. - There is no behavior change for spi because it uses REQ_FAILFAST_DEV. > > Note that the retry logic of the mid level may change depending on the > installed device handler. For example, ALUA devices will endlessly > retry UA with ASC=29, which some callers explicitly look for. There is no behavior change with my patch for ASC=29 case, because it uses ADD_TO_MLQUEUE and we don't run scsi_noretry_cmd for that error. It could change how 0x04/0x0a is handled because it uses NEEDS_RETRY. However, scsi_dh_alua uses REQ_FAILFAST_DEV so we do not retry in scsi_noretry_cmd like before. > > I believe we need to review all callers that have their own retry loop > and /or error handling logic. This includes sd_spinup_disk(), > sd_sync_cache(), scsi_test_unit_ready(). SCSI device handlers treat > some sense keys in special ways, or may retry commands on another > member of a port group (see alua_rtpg()). There is no change in behavior for the alua one but agree with the general comment. > > DID_TIME_OUT is a general concern - no current caller of scsi_execute() > expects timed-out commands to be retried, and doing so has the > potential of slowing down operations a lot. I am aware that my recent > patch changed exactly this for scsi_probe_lun(), but doing the same > thing for every scsi_execute() invocation sounds at least somewhat > dangerous. Agree this patch is wrong: - With patch1 that fixes scsi_cmnd->allowed we can end up with N and M DID_TIME_OUT retries and that can get crazy. So agree with that. - For the general idea of do we always want to retry DID_TIME_OUT, I can I also see your point. - After reading your mail and thinking about patch 4, I was thinking that this is wrong for patch 4 as well. For the pr_ops case we want the opposite of what you were mentioning in here. I actually want scsi-ml to retry all UAs for me or allow me to retry all UAs and not just handle the specific PGR related ones. I'm not sure where to go from here: 1. Just have the callers drive extra retries like we do now. I guess I've come around and are ok with this. With patch1, scsi_cmnd->allowed and scsi_execute works for me, so my previous comment about scsi_probe_lun needing to retry that case does not apply since scsi-ml will do it for me. I do think we need to retry your case in other places though. 2. Instead of trying to make it general for all scsi_execute_users, we can add SCMD bits for specific cases like DID_TIME_OUT or a SCMD bit that tells scsi_noretry_cmd to not always fail passthrough commands just because they are passthrough. It would work the opposite of the FASTFAIL bits where instead of failing fast, we retry. I think because the cases scsi_noretry_cmd is used for are really specific cases (scsi_decide_disposition sees NEEDS_RETRY, retries < allowed, and REQ_FAILFAST_DEV is not set) that might not be very useful. I'm not sure we want to add a bunch of cases specific to scsi_execute callers to scsi_check_sense? I don't think we do.