Re: [QUESTION]: Why did we clear the lowest bit of SCSI command's status in scsi_status_is_good

James Bottomley <jejb@xxxxxxxxxxxxx> · Fri, 02 Dec 2022 09:17:43 -0500

On Fri, 2022-12-02 at 21:58 +0800, Wenchao Hao wrote:
> 
> On 2022/11/29 2:12, James Bottomley wrote:
> > On Tue, 2022-11-29 at 01:38 +0800, Wenchao Hao wrote:
> > > On Mon, Nov 28, 2022 at 11:24 PM James Bottomley
> > > <jejb@xxxxxxxxxxxxx>
> > > wrote:
> > > > 
> > > > On Mon, 2022-11-28 at 22:41 +0800, Wenchao Hao wrote:
> > > > > On 2022/11/28 20:52, James Bottomley wrote:
> > > > > > On Mon, 2022-11-28 at 11:58 +0800, Wenchao Hao wrote:
> > > > [...]
> > > > > > > We found some firmware or drivers would return status
> > > > > > > which
> > > > > > > did not defined in SAM. Now SCSI middle level do not have
> > > > > > > an
> > > > > > > uniform behavior for these undefined status. I want to
> > > > > > > change
> > > > > > > the logic to return error for all status which did not
> > > > > > > defined in SAM or define a method in host template to let
> > > > > > > drivers to judge what to do in this condition.
> > > > > > 
> > > > > > Why? The general rule of thumb is be strict in what you
> > > > > > emit
> > > > > > and generous in what you receive (which is why reserved
> > > > > > bits
> > > > > > are ignored). Is the drive you refer to above not working
> > > > > > in
> > > > > > some way, in which case detail it so people can understand
> > > > > > the
> > > > > > actual problem.
> > > > > > 
> > > > > > James
> > > > > > 
> > > > > > .
> > > > > 
> > > > > 
> > > > > We come with an issue with megaraid_sas driver. Where
> > > > > scsi_cmnd
> > > > > is completed with result's status byte set to 1,
> > > > 
> > > > Megaraid_sas is an emulation driver for the most part, so it
> > > > sounds
> > > > like this is in the RAID emulation firmware, correct?  The
> > > > driver
> > > > can correct for emulation failures, if you can figure out what
> > > > it's
> > > > trying to signal and convert it to the correct SAM error code.
> > > > There's no need to change anything in the layers above.  If you
> > > > can't figure out the translation and you want the transfer to
> > > > error, then add DID_ERROR, which is a nice catch all.  If this
> > > > is
> > > > transient and could be fixed by a retry, then do DID_SOFT_ERROR
> > > > instead.
> > > > 
> > > > James
> > > > 
> > > 
> > > Thanks for your answer, Of curse we can recognize these undefined
> > > status and map to an error which can be handled by SCSI middle
> > > level
> > > now. But I still confused why shouldn't we change the
> > > scsi_status_is_good() to respect to SAM?
> > 
> > Because it wouldn't be backwards compatible and something might
> > break.
> > Under SCSI-1, devices were allowed to set this bit to signal vendor
> > unique status and a lot of manufacturers continued doing this for
> > SCSI-
> > 2, even though it was flagged as reserved instead of vendor
> > specific in
> > that standard, hence the mask.  Since this was over 20 years ago,
> > it is
> > possible there is no remaining functional device that does this,
> > but if
> > it's not causing a problem, why take the risk?
> > 
> > James
> > 
> > .
> 
> Hi James, thank you very much for your answer.
> 
> I think we should think about the following functions of megaraid
> driver:
> 
> megasas_complete_cmd() defined in
> drivers/scsi/megaraid/megaraid_sas_base.c,
> megasas_complete_cmd
>         ...
>         switch (hdr->cmd) {
>         ...
>         case MFI_CMD_LD_READ:
>         case MFI_CMD_LD_WRITE:
>                 switch (hdr->cmd_status) {
>                 case MFI_STAT_SCSI_DONE_WITH_ERROR:
>                         cmd->scmd->result = (DID_OK << 16) | hdr-
> >scsi_status;
>                         break;
>                 ...
>                 }
>         ...
>         }
> 
> map_cmd_status() defined in
> drivers/scsi/megaraid/megaraid_sas_fusion.c 
> map_cmd_status
>         ...
>         switch (status) {
>         ...
>         case MFI_STAT_SCSI_DONE_WITH_ERROR:
>                 scmd->result = (DID_OK << 16) | ext_status;
>                 break;
>         ...
>         }
> 
> Both of these functions did not check the status byte, which can not
> make sure the status byte is defined in SAM.

Right, but the first one should be returning actual status from the
drive, so should be OK.  The second one looks to be returning
manufactured raid status, which is likely the problem.

in either case, just fix the code to return DID_ERROR<<16 if the status
is non SAM conforming.

> What we meet is the status byte set to 1, and the host_byte is set to
> DID_OK.
> 
> In this condition, the scsi_cmnd would be finished by scsi middle
> layer with BLK_STS_OK if the kernel version is before 3d45cefc8edd7
> (scsi: core: Drop obsolete Linux-specific SCSI status codes).

I don't believe it does: that commit should produce identical code
before and after; it merely replaced the shifted status conditions with
the unshifted ones.

> Because scsi_io_completion_nz_result() would return a non zero value,
> so we call scsi_io_completion_action() to handle this command.
> While in scsi_io_completion_action(), the blk_stat mapped by
> scsi_result_to_blk_status()
> is BLK_STS_OK which finally result in the command finished with
> BLK_STS_OK.

So as I said previously, get the driver to return DID_ERROR<<16 for the
bogus SAM conditions.

James