Re: sd: Unaligned partial completion

Douglas Gilbert <dgilbert@xxxxxxxxxxxx> · Sun, 20 Feb 2022 02:16:54 -0500

On 2022-02-19 20:35, Damien Le Moal wrote:
On 2/20/22 09:56, Douglas Gilbert wrote:
On 2022-02-19 17:46, Martin K. Petersen wrote:

Douglas,

What should the sd driver do when it gets the error in the subject
line? Try again, and again, and again, and again ...?

sd 2:0:1:0: [sdb] Unaligned partial completion (resid=3584, sector_sz=4096)
sd 2:0:1:0: [sdb] tag#407 CDB: Read(16) 88 00 00 00 00 00 00 00 00 00 00 00 00 01 00

Not very productive, IMO. Perhaps, after say 3 retries getting the
_same_ resid, it might rescan that disk. There is a big hint in the
logged data shown above: trying to READ 1 block (sector_sz=4096) and
getting a resid of 3584. So it got back 512 bytes (again and again
...). The disk isn't mounted so perhaps it is being prepared. And
maybe that preparation involved a MODE SELECT which changed the LB
size in its block descriptor, prior to a FORMAT UNIT.

The kernel doesn't inspect passthrough commands to track whether an
application is doing MODE SELECT or FORMAT UNIT. The burden is generally
on the application to do the right thing.

No, of course not. But the kernel should inspect all UAs especially the one
that says: CAPACITY DATA HAS CHANGED !

I'm assuming we're trying to read the partition table. Did the device
somehow get closed between the MODE SELECT and the FORMAT UNIT?

Nope, look up "format corrupt" state in SBC, there is a asc/ascq code for
that, and it was _not_ reported in this case. The disk was fine after those
two commands, it was sd or the scsi mid-level that didn't observe the UAs,
hence the snafu. Sending a READ command after a CAPACITY DATA HAS CHANGE
UA is "undefined behaviour" as the say in the C/C++ spec.

Also more and more settings in SCSI *** are giving the option to return an
error (even MEDIUM ERROR) if the initiator is reading a block that has never
been written. So if the sd driver is looking for a partition table (LBA 0 ?)
then you have a chicken and egg problem that retrying will not solve.

It is not the scsi driver looking for partitions. This is generic block
layer code rescanning the partition table together with disk revalidate
after the bdev is closed. The disk revalidate should have caught the
change in LBA size, so it may be that the partition scan is before
revalidate instead of after... That would need checking.

Another issue with that error message: what does "unaligned" mean in
this context? Surely it is superfluous and "Partial completion" is
more accurate (unless the resid is negative).

The "unaligned" term comes from ZBC.

The sd driver should take its lead from SBC, not ZBC.

It was observed in the past that some HBAs (Broadcom I think it was)
returned a resid not aligned to the LBA size with 4Kn disks, making it
impossible to restart the command to process the reminder of the data.

But restarting the READ of one "logical block" at LBA 0 when the kernel
thought that was 4096 bytes and the drive returned 512 bytes is exactly
what I observed; again and again.

IMO the kernel should be prepared for surprises when reading LBA 0,
such as:
  - the block size is not what it was expecting [as in this case]
  - that block has never been written and the disk has been told to
    return an (IO) error in that case

It is a pity that a SCSI pass-through like the bsg or sg driver cannot
establish its own I_T nexus, separate from the I_T nexus that the
sd driver uses. The reason is that if an I_T nexus causes a UA (e.g.
MODE SELECT change LB size) then the next command (apart from
INQUIRY, REPORT LUNS and friends) will _not_ receive that UA. [Other
I_T nexi will receive that UA.]

This problem was especially apparent with ZBC disks writes. > So unaligned here is not just for ZBC disks.

SCSI data-out and data-in transfers are inherently unaligned (or byte
aligned) but I suppose the DMA silicon in the HBA may have some
alignment requirements.

Doug Gilbert

*** for example, FORMAT UNIT (FFMT=2)