Re: sd: Unaligned partial completion

Damien Le Moal <damien.lemoal@xxxxxxxxxxxxxxxxxx> · Mon, 21 Feb 2022 09:13:06 +0900

On 2022/02/20 16:16, Douglas Gilbert wrote:
> On 2022-02-19 20:35, Damien Le Moal wrote:
>> On 2/20/22 09:56, Douglas Gilbert wrote:
>>> On 2022-02-19 17:46, Martin K. Petersen wrote:
>>>>
>>>> Douglas,
>>>>
>>>>> What should the sd driver do when it gets the error in the subject
>>>>> line? Try again, and again, and again, and again ...?
>>>>>
>>>>> sd 2:0:1:0: [sdb] Unaligned partial completion (resid=3584, sector_sz=4096)
>>>>> sd 2:0:1:0: [sdb] tag#407 CDB: Read(16) 88 00 00 00 00 00 00 00 00 00 00 00 00 01 00
>>>>>
>>>>> Not very productive, IMO. Perhaps, after say 3 retries getting the
>>>>> _same_ resid, it might rescan that disk. There is a big hint in the
>>>>> logged data shown above: trying to READ 1 block (sector_sz=4096) and
>>>>> getting a resid of 3584. So it got back 512 bytes (again and again
>>>>> ...). The disk isn't mounted so perhaps it is being prepared. And
>>>>> maybe that preparation involved a MODE SELECT which changed the LB
>>>>> size in its block descriptor, prior to a FORMAT UNIT.
>>>>
>>>> The kernel doesn't inspect passthrough commands to track whether an
>>>> application is doing MODE SELECT or FORMAT UNIT. The burden is generally
>>>> on the application to do the right thing.
>>>
>>> No, of course not. But the kernel should inspect all UAs especially the one
>>> that says: CAPACITY DATA HAS CHANGED !
>>>
>>>> I'm assuming we're trying to read the partition table. Did the device
>>>> somehow get closed between the MODE SELECT and the FORMAT UNIT?
>>>
>>> Nope, look up "format corrupt" state in SBC, there is a asc/ascq code for
>>> that, and it was _not_ reported in this case. The disk was fine after those
>>> two commands, it was sd or the scsi mid-level that didn't observe the UAs,
>>> hence the snafu. Sending a READ command after a CAPACITY DATA HAS CHANGE
>>> UA is "undefined behaviour" as the say in the C/C++ spec.
>>>
>>> Also more and more settings in SCSI *** are giving the option to return an
>>> error (even MEDIUM ERROR) if the initiator is reading a block that has never
>>> been written. So if the sd driver is looking for a partition table (LBA 0 ?)
>>> then you have a chicken and egg problem that retrying will not solve.
>>
>> It is not the scsi driver looking for partitions. This is generic block
>> layer code rescanning the partition table together with disk revalidate
>> after the bdev is closed. The disk revalidate should have caught the
>> change in LBA size, so it may be that the partition scan is before
>> revalidate instead of after... That would need checking.
>>
>>>>> Another issue with that error message: what does "unaligned" mean in
>>>>> this context? Surely it is superfluous and "Partial completion" is
>>>>> more accurate (unless the resid is negative).
>>>>
>>>> The "unaligned" term comes from ZBC.
>>>
>>> The sd driver should take its lead from SBC, not ZBC.
>>
>> It was observed in the past that some HBAs (Broadcom I think it was)
>> returned a resid not aligned to the LBA size with 4Kn disks, making it
>> impossible to restart the command to process the reminder of the data.
> 
> But restarting the READ of one "logical block" at LBA 0 when the kernel
> thought that was 4096 bytes and the drive returned 512 bytes is exactly
> what I observed; again and again.

As I said, it may be because the block layer disk revalidate call and partition
scan are reversed, or not synchronized, causing the partition scan read to be
dealt with without the sector size yet being updated in the sd driver. We should
check the block layer. Will have a look.

> 
> IMO the kernel should be prepared for surprises when reading LBA 0,
> such as:
>    - the block size is not what it was expecting [as in this case]
>    - that block has never been written and the disk has been told to
>      return an (IO) error in that case
> 
> It is a pity that a SCSI pass-through like the bsg or sg driver cannot
> establish its own I_T nexus, separate from the I_T nexus that the
> sd driver uses. The reason is that if an I_T nexus causes a UA (e.g.
> MODE SELECT change LB size) then the next command (apart from
> INQUIRY, REPORT LUNS and friends) will _not_ receive that UA. [Other
> I_T nexi will receive that UA.]
> 
>> This problem was especially apparent with ZBC disks writes. > So unaligned here is not just for ZBC disks.
> 
> SCSI data-out and data-in transfers are inherently unaligned (or byte
> aligned) but I suppose the DMA silicon in the HBA may have some
> alignment requirements.

Sure, I know that. But the kernel never asks for unaligned read/writes and the
disk will certainly never return a half backed sector for reads or partially
writes sectors. So getting back a resid that is not aligned on the LBA size is a
gross bug from the HBA and we should not allow that to go unnoticed.

> 
>>
>>>
>>> Doug Gilbert
>>>
>>>
>>> *** for example, FORMAT UNIT (FFMT=2)
>>>
>>
>>
> 

-- 
Damien Le Moal
Western Digital Research