Re: [PATCH v3 4/4] io_uring: add support for zone-append

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 2020/07/21 10:15, Matthew Wilcox wrote:
> On Tue, Jul 21, 2020 at 12:59:59AM +0000, Damien Le Moal wrote:
>> On 2020/07/21 5:17, Kanchan Joshi wrote:
>>> On Mon, Jul 20, 2020 at 10:44 PM Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote:
>>>>  struct io_uring_cqe {
>>>>         __u64   user_data;      /* sqe->data submission passed back */
>>>> -       __s32   res;            /* result code for this event */
>>>> -       __u32   flags;
>>>> +       union {
>>>> +               struct {
>>>> +                       __s32   res;    /* result code for this event */
>>>> +                       __u32   flags;
>>>> +               };
>>>> +               __s64           res64;
>>>> +       };
>>>>  };
>>>>
>>>> Return the value in bytes in res64, or a negative errno.  Done.
>>>
>>> I concur. Can do away with bytes-copied. It's either in its entirety
>>> or not at all.
>>>
>>
>> SAS SMR drives may return a partial completion. So the size written may be less
>> than requested, but not necessarily 0, which would be an error anyway since any
>> condition that would lead to 0B being written will cause the drive to fail the
>> command with an error.
> 
> Why might it return a short write?  And, given how assiduous programmers
> are about checking for exceptional conditions, is it useful to tell
> userspace "only the first 512 bytes of your 2kB write made it to storage"?
> Or would we rather just tell userspace "you got an error" and _not_
> tell them that the first N bytes made it to storage?

If the write hits a bad sector on disk half-way through, a SAS drive may return
a short write with a non 0 residual. SATA drives will fail the entire command
and libata will retry the failed command. That said, if the drive fails to remap
a bad sector and return an error to the host, it is generally an indicator that
one should go to the store to get a new drive :)

Yes, you have a good point. Returning an error for the entire write would be
fine. The typical error handling for a failed write to a zone is for the user to
first do a zone report to inspect the zone condition and WP position, resync its
view of the zone state and restart the write in the same zone or somewhere else.
So failing the entire write is OK.
I am actually not 100% sure what the bio interface does if the "restart
remaining" of a partially failed request fails again after all retry attempts.
The entire BIO is I think failed. Need to check. So the high level user would
not see the partial failure as that stays within the scsi layer.

>> Also, the completed size should be in res in the first cqe to follow io_uring
>> current interface, no ?. The second cqe would use the res64 field to return the
>> written offset. Wasn't that the plan ?
> 
> two cqes for one sqe seems like a bad idea to me.

Yes, this is not very nice. I got lost in the thread. I thought that was the plan.


-- 
Damien Le Moal
Western Digital Research




[Index of Archives]     [Linux Samsung SoC]     [Linux Rockchip SoC]     [Linux Actions SoC]     [Linux for Synopsys ARC Processors]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]


  Powered by Linux