Re: [PATCH v11 04/16] scsi: core: Introduce a mechanism for reordering requests in the error handler

Damien Le Moal <dlemoal@xxxxxxxxxx> · Thu, 24 Aug 2023 08:22:59 +0900

On 8/24/23 00:15, Bart Van Assche wrote:
> On 8/22/23 23:26, Hannes Reinecke wrote:
>> On 8/22/23 21:16, Bart Van Assche wrote:
>>> +/*
>>> + * Comparison function that allows to sort SCSI commands by ULD driver.
>>> + */
>>> +static int scsi_cmp_uld(void *priv, const struct list_head *_a,
>>> +            const struct list_head *_b)
>>> +{
>>> +    struct scsi_cmnd *a = list_entry(_a, typeof(*a), eh_entry);
>>> +    struct scsi_cmnd *b = list_entry(_b, typeof(*b), eh_entry);
>>> +
>>> +    /* See also the comment above the list_sort() definition. */
>>> +    return scsi_cmd_to_driver(a) > scsi_cmd_to_driver(b);
>>
>> I have to agree with Christoph here.
>> Comparing LBA numbers at the SCSI level is really the wrong place.
>> SCSI commands might be anything, and quite some of these commands don't
>> even have LBA numbers. So trying to order them will be pointless.
>>
>> The reordering mechanism really has to go into the block layer, with
>> the driver failing the request and the block layer resubmitting in-order.
> 
> Hi Hannes,
> 
> Please take another look at patches 04/16 and 05/16. As one can see no
> LBA numbers are being compared in the SCSI core - comparing LBA numbers
> happens in the sd (SCSI disk) driver. The code that you replied to
> compares ULD pointers, a well-defined concept in the SCSI core.
> 
> Your request to move the functionality from patches 04/16 and 05/16 into
> the block layer would involve the following:
> * Report the unaligned write errors (because a write did not happen at the
>    write pointer) to the block layer (BLK_STS_WP_MISMATCH?).
> * Introduce a mechanism in the block layer for postponing error handling
>    until all outstanding commands have failed. The approach from the SCSI
>    core (tracking the number of failed and the number of busy commands
>    and only waking up the error handler after these counters are equal)
>    would be unacceptable because of the runtime overhead this mechanism
>    would introduce in the block layer hot path. Additionally, I strongly
>    doubt that it is possible to introduce any mechanism for postponing
>    error handling in the block layer without introducing additional
>    overhead in the hot path.
> * Christoph's opinion is that NVMe software should use zone append
>    (REQ_OP_ZONE_APPEND) instead of regular writes (REQ_OP_WRITE) when
>    writing to a zoned namespace. So the SCSI subsystem would be the only
>    user of the new mechanism introduced in the block layer. The reason we
>    chose REQ_OP_WRITE for zoned UFS devices is because the SCSI standard
>    does not support a zone append command and introducing a zone append
>    command in the SCSI standards is not something that can be realized in
>    time for the first generation of zoned UFS devices.

The sd driver does have zone append emulation using regular writes. The
emulation relies on zone write locking to avoid issues with adapters that do not
have strong ordering guarantees, but that could be adapted to be removed for UFS
devices with write ordering guarantees. This solution would greatly simplify
your series since zone append requests are not subject to zone write locking at
the block layer. So no changes would be needed at that layer.

However, that implies that your preferred use case (f2fs) must be adapted to use
zone append instead of regular writes. That in itself may be a bigger-ish
change, but in the long run, likely a better one I think as that would be
compatible with NVMe ZNS and also future UFS standards defining a zone append
command.

> 
> Because I assume that both Jens and Christoph disagree strongly with your
> request: I have no plans to move the code for sorting zoned writes into
> the block layer core.
> 
> Jens and Christoph, please correct me if I misunderstood something.
> 
> Thanks,
> 
> Bart.

-- 
Damien Le Moal
Western Digital Research