On 2020/07/21 10:15, Matthew Wilcox wrote: > On Tue, Jul 21, 2020 at 12:59:59AM +0000, Damien Le Moal wrote: >> On 2020/07/21 5:17, Kanchan Joshi wrote: >>> On Mon, Jul 20, 2020 at 10:44 PM Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote: >>>> struct io_uring_cqe { >>>> __u64 user_data; /* sqe->data submission passed back */ >>>> - __s32 res; /* result code for this event */ >>>> - __u32 flags; >>>> + union { >>>> + struct { >>>> + __s32 res; /* result code for this event */ >>>> + __u32 flags; >>>> + }; >>>> + __s64 res64; >>>> + }; >>>> }; >>>> >>>> Return the value in bytes in res64, or a negative errno. Done. >>> >>> I concur. Can do away with bytes-copied. It's either in its entirety >>> or not at all. >>> >> >> SAS SMR drives may return a partial completion. So the size written may be less >> than requested, but not necessarily 0, which would be an error anyway since any >> condition that would lead to 0B being written will cause the drive to fail the >> command with an error. > > Why might it return a short write? And, given how assiduous programmers > are about checking for exceptional conditions, is it useful to tell > userspace "only the first 512 bytes of your 2kB write made it to storage"? > Or would we rather just tell userspace "you got an error" and _not_ > tell them that the first N bytes made it to storage? If the write hits a bad sector on disk half-way through, a SAS drive may return a short write with a non 0 residual. SATA drives will fail the entire command and libata will retry the failed command. That said, if the drive fails to remap a bad sector and return an error to the host, it is generally an indicator that one should go to the store to get a new drive :) Yes, you have a good point. Returning an error for the entire write would be fine. The typical error handling for a failed write to a zone is for the user to first do a zone report to inspect the zone condition and WP position, resync its view of the zone state and restart the write in the same zone or somewhere else. So failing the entire write is OK. I am actually not 100% sure what the bio interface does if the "restart remaining" of a partially failed request fails again after all retry attempts. The entire BIO is I think failed. Need to check. So the high level user would not see the partial failure as that stays within the scsi layer. >> Also, the completed size should be in res in the first cqe to follow io_uring >> current interface, no ?. The second cqe would use the res64 field to return the >> written offset. Wasn't that the plan ? > > two cqes for one sqe seems like a bad idea to me. Yes, this is not very nice. I got lost in the thread. I thought that was the plan. -- Damien Le Moal Western Digital Research