Re: [PATCH v2 12/20] block,nvme,scsi,dm: Add blk_status to pr_ops callouts.

Mike Christie <michael.christie@xxxxxxxxxx> · Tue, 9 Aug 2022 22:34:13 -0500

On 8/9/22 2:33 PM, Bart Van Assche wrote:
> On 8/9/22 11:08, Mike Christie wrote:
>> On 8/9/22 2:21 AM, Christoph Hellwig wrote:
>>> On Mon, Aug 08, 2022 at 07:04:11PM -0500, Mike Christie wrote:
>>>> To handle both cases, this patch adds a blk_status_t arg to the pr_ops
>>>> callouts. The lower levels will convert their device specific error to
>>>> the blk_status_t then the upper levels can easily check that code
>>>> without knowing the device type. It also allows us to keep userspace
>>>> compat where it expects a negative -Exyz error code if the command fails
>>>> before it's sent to the device or a device/tranport specific value if the
>>>> error is > 0.
>>>
>>> Why do we need two return values here?
>>
>> I know the 2 return values are gross :) I can do it in one, but I wasn't sure
>> what's worse. See below for the other possible solutions. I think they are all
>> bad.
>>
>>
>> 0. Convert device specific conflict error to -EBADE then back:
>>
>> sd_pr_command()
>>
>> .....
>>
>> /* would add similar check for NVME_SC_RESERVATION_CONFLICT in nvme */
>> if (result == SAM_STAT_CHECK_CONDITION)
>>     return -EBADE;
>> else
>>     return result;
>>
>>
>> LIO then just checks for -EBADE but when going to userspace we have to
>> convert:
>>
>>
>> blkdev_pr_register()
>>
>> ...
>>     result = ops->pr_register()
>>     if (result < 0) {
>>         /* For compat we must convert back to the nvme/scsi code */
>>         if (result == -EBADE) {
>>             /* need some helper for this that calls down the stack */
>>             if (bdev == SCSI)
>>                 return SAM_STAT_RESERVATION_CONFLICT
>>             else
>>                 return NVME_SC_RESERVATION_CONFLICT
>>         } else
>>             return blk_status_to_str(result)
>>     } else
>>         return result;
>>
>>
>> The conversion is kind of gross and I was thinking in the future it's going
>> to get worse. I'm going to want to have more advanced error handling in LIO
>> and dm-multipath. Like dm-multipath wants to know if an pr_op failed because
>> of a path failure, so it can retry another one, or a hard device/target error.
>> It would be nice for LIO if an PGR had bad/illegal values and the device
>> returned an error than I could detect that.
>>
>>
>> 1. Drop the -Exyz error type and use blk_status_t in the kernel:
>>
>> sd_pr_command()
>>
>> .....
>> if (result < 0)
>>     return -errno_to_blk_status(result);
>> else if (result == SAM_STAT_CHECK_CONDITION)
>>     return -BLK_STS_NEXUS;
>> else
>>     return result;
>>
>> blkdev_pr_register()
>>
>> ...
>>     result = ops->pr_register()
>>     if (result < 0) {
>>         /* For compat we must convert back to the nvme/scsi code */
>>         if (result == -BLK_STS_NEXUS) {
>>             /* need some helper for this that calls down the stack */
>>             if (bdev == SCSI)
>>                 return SAM_STAT_RESERVATION_CONFLICT
>>             else
>>                 return NVME_SC_RESERVATION_CONFLICT
>>         } else
>>             return blk_status_to_str(result)
>>     } else
>>         return result;
>>
>> This has similar issues as #0 where we have to convert before returning to
>> userspace.
>>
>>
>> Note: In this case, if the block layer uses an -Exyz error code there's not
>> BLK_STS for then we would return -EIO to userspace now. I was thinking
>> that might not be ok but I could also just add a BLK_STS error code
>> for errors like EINVAL, EWOULDBLOCK, ENOMEM, etc so that doesn't happen.
>>
>>
>> 2. We could do something like below where the low levels are not changed but the
>> caller converts:
>>
>> sd_pr_command()
>>     /* no changes */
>>
>> lio()
>>     result = ops->pr_register()
>>     if (result > 0) {
>>         /* add some stacked helper again that goes through dm and
>>          * to the low level device
>>          */
>>         if (bdev == SCSI) {
>>             result = scsi_result_to_blk_status(result)
>>         else
>>             result = nvme_error_status(result)
>>
>>
>> This looks simple, but it felt wrong having upper layers having to
>> know the device type and calling conversion functions.
> 
> Has it been considered to introduce a new enumeration type instead of choosing (0), (1) or (2)?
> 

The problem is that userspace currently gets the nvme status value or the
scsi_cmnd->result which can be host/status byte values like with SG IO.
So you could you just do a new enum or add every possible error to blk_status_t
but before passing back to userspace you still have to then convert to what
format userspace is getting today. So for scsi devices, you have to mimic
the host_byte.