On Fri, Dec 2, 2022 at 10:17 PM James Bottomley <jejb@xxxxxxxxxxxxx> wrote: > > On Fri, 2022-12-02 at 21:58 +0800, Wenchao Hao wrote: > > > > On 2022/11/29 2:12, James Bottomley wrote: > > > On Tue, 2022-11-29 at 01:38 +0800, Wenchao Hao wrote: > > > > On Mon, Nov 28, 2022 at 11:24 PM James Bottomley > > > > <jejb@xxxxxxxxxxxxx> > > > > wrote: > > > > > > > > > > On Mon, 2022-11-28 at 22:41 +0800, Wenchao Hao wrote: > > > > > > On 2022/11/28 20:52, James Bottomley wrote: > > > > > > > On Mon, 2022-11-28 at 11:58 +0800, Wenchao Hao wrote: > > > > > [...] > > > > > > > > We found some firmware or drivers would return status > > > > > > > > which > > > > > > > > did not defined in SAM. Now SCSI middle level do not have > > > > > > > > an > > > > > > > > uniform behavior for these undefined status. I want to > > > > > > > > change > > > > > > > > the logic to return error for all status which did not > > > > > > > > defined in SAM or define a method in host template to let > > > > > > > > drivers to judge what to do in this condition. > > > > > > > > > > > > > > Why? The general rule of thumb is be strict in what you > > > > > > > emit > > > > > > > and generous in what you receive (which is why reserved > > > > > > > bits > > > > > > > are ignored). Is the drive you refer to above not working > > > > > > > in > > > > > > > some way, in which case detail it so people can understand > > > > > > > the > > > > > > > actual problem. > > > > > > > > > > > > > > James > > > > > > > > > > > > > > . > > > > > > > > > > > > > > > > > > We come with an issue with megaraid_sas driver. Where > > > > > > scsi_cmnd > > > > > > is completed with result's status byte set to 1, > > > > > > > > > > Megaraid_sas is an emulation driver for the most part, so it > > > > > sounds > > > > > like this is in the RAID emulation firmware, correct? The > > > > > driver > > > > > can correct for emulation failures, if you can figure out what > > > > > it's > > > > > trying to signal and convert it to the correct SAM error code. > > > > > There's no need to change anything in the layers above. If you > > > > > can't figure out the translation and you want the transfer to > > > > > error, then add DID_ERROR, which is a nice catch all. If this > > > > > is > > > > > transient and could be fixed by a retry, then do DID_SOFT_ERROR > > > > > instead. > > > > > > > > > > James > > > > > > > > > > > > > Thanks for your answer, Of curse we can recognize these undefined > > > > status and map to an error which can be handled by SCSI middle > > > > level > > > > now. But I still confused why shouldn't we change the > > > > scsi_status_is_good() to respect to SAM? > > > > > > Because it wouldn't be backwards compatible and something might > > > break. > > > Under SCSI-1, devices were allowed to set this bit to signal vendor > > > unique status and a lot of manufacturers continued doing this for > > > SCSI- > > > 2, even though it was flagged as reserved instead of vendor > > > specific in > > > that standard, hence the mask. Since this was over 20 years ago, > > > it is > > > possible there is no remaining functional device that does this, > > > but if > > > it's not causing a problem, why take the risk? > > > > > > James > > > > > > . > > > > Hi James, thank you very much for your answer. > > > > I think we should think about the following functions of megaraid > > driver: > > > > megasas_complete_cmd() defined in > > drivers/scsi/megaraid/megaraid_sas_base.c, > > megasas_complete_cmd > > ... > > switch (hdr->cmd) { > > ... > > case MFI_CMD_LD_READ: > > case MFI_CMD_LD_WRITE: > > switch (hdr->cmd_status) { > > case MFI_STAT_SCSI_DONE_WITH_ERROR: > > cmd->scmd->result = (DID_OK << 16) | hdr- > > >scsi_status; > > break; > > ... > > } > > ... > > } > > > > map_cmd_status() defined in > > drivers/scsi/megaraid/megaraid_sas_fusion.c > > map_cmd_status > > ... > > switch (status) { > > ... > > case MFI_STAT_SCSI_DONE_WITH_ERROR: > > scmd->result = (DID_OK << 16) | ext_status; > > break; > > ... > > } > > > > Both of these functions did not check the status byte, which can not > > make sure the status byte is defined in SAM. > > Right, but the first one should be returning actual status from the > drive, so should be OK. The second one looks to be returning > manufactured raid status, which is likely the problem. > > in either case, just fix the code to return DID_ERROR<<16 if the status > is non SAM conforming. > > > What we meet is the status byte set to 1, and the host_byte is set to > > DID_OK. > > > > In this condition, the scsi_cmnd would be finished by scsi middle > > layer with BLK_STS_OK if the kernel version is before 3d45cefc8edd7 > > (scsi: core: Drop obsolete Linux-specific SCSI status codes). > > I don't believe it does: that commit should produce identical code > before and after; it merely replaced the shifted status conditions with > the unshifted ones. > Here is how the commit affect the flow: When the command finished with status set to 1, and the actual finished byte is less than requested. Before this commit, scsi_io_completion_nz_result would check result like following: if (status_byte(result) && scsi_status_is_good(result)) { result = 0; *blk_statp = BLK_STS_OK; } If the lowest byte of result is 1, the status_byte() returned false, the result would not be updated. So scsi_io_completion_nz_result() returned a non zero value. We would call scsi_io_completion_action() to handle this command after scsi_end_request() failed. In scsi_io_completion_action(), the command should be finished with BLK_STS_IOERR, while it is finished with BLK_STS_OK. After this commit, scsi_io_completion_nz_result would check result like following: if (result & 0xff && scsi_status_is_good(result)) { result = 0; *blk_statp = BLK_STS_OK; } Where result && 0xff is 1, the result would be 0, and scsi_io_completion_nz_result() returned 0. We would call scsi_mq_requeue_cmd() to requeue this command after scsi_end_request() failed.