Re: sd: Unaligned partial completion

Douglas Gilbert <dgilbert@xxxxxxxxxxxx> · Wed, 23 Feb 2022 18:58:14 -0500

On 2022-02-23 17:47, Damien Le Moal wrote:
On 2/24/22 06:37, Douglas Gilbert wrote:
On 2022-02-22 22:27, Martin K. Petersen wrote:

Douglas,

No, of course not. But the kernel should inspect all UAs especially
the one that says: CAPACITY DATA HAS CHANGED !

It does. And uses it to emit an event to userland.

In most cases when capacity has changed it is because the user grew
their LUN. And doing the right thing in that case is to let userland
decide how to deal with it.

You could argue that the kernel should do something if the device
capacity shrinks. But it is unclear to me what "the right thing" is in
all cases. What if there is not a mounted filesystem in the affected
block range? Maybe the reduced block range it is not even described by
an entry in the partition table? Should we do something? How does SCSI
know how much of the capacity is actively in use, if any? Also, what's a
partition?

In addition, given our experience with NVMe devices which, for $OTHER_OS
reasons, truncated their capacity when they experienced media problems,
I am not sure we want to encourage anybody ever going down this
path. What a mess!

But this misses my point. sbc5r01.pdf says this:

    "If the device server supports changing the block descriptor parameters
     by a MODE SELECT command and the number of logical blocks or the
     logical block length is changed, then the device server establishes
     a unit attention condition of:
        a) CAPACITY DATA HAS CHANGED as described in 4.10; and
        b) MODE PARAMETERS CHANGED as described in SPC-6.

My point is: if "the logical block length is changed" then the sd driver
can NOT reliably issue any IO commands (READ or WRITE) until it does a
READ CAPACITY command to find out whether the LB size has changed, and
if so, to what.

Also more and more settings in SCSI *** are giving the option to
return an error (even MEDIUM ERROR) if the initiator is reading a
block that has never been written. So if the sd driver is looking for
a partition table (LBA 0 ?)  then you have a chicken and egg problem
that retrying will not solve.

For a general purpose OS it is completely unreasonable to expect that
the OS has prior knowledge about which blocks were written. How is that
even supposed to work if you plug in a USB drive from a different
machine/OS? It also breaks the notion of array disks being
self-describing which is now effectively an industry requirement.

I am very happy to take patches that prevent us from retrying forever
when a device is being disagreeable. But I am also very comfortable with
the notion of not bothering to support devices that behave in a
nonsensical way. Just because the SCSI spec allows something doesn't
mean we should support it.

The sd driver should take its lead from SBC, not ZBC.

The sd driver is the driver for both protocols.

This "unaligned" usage seems to come from ZBC and first appeared in
SPC-4, ASC/ACSQ code [0x21,0x4]: "Unaligned WRITE command". It is
the only use of the word "unaligned" in SPC-4, SPC-5 and spc6r06.pdf
and it is not defined (in those documents) or in the SBC specs.
Surprisingly it is used, but not defined in zbc2r12.pdf .

To me "unaligned" means some sort of transport issue which this is
not ***. It simply means the WRITE was not issued with a starting
LBA which corresponded to that zone's write pointer. This is
for "sequential write required" (swr)zones. IMO the ASC message
should be akin to: "Sequential write requirement violated".

Until Linux utilities catch up with zoned disks, users of zoned
disks are going to see a lot of that "unaligned"  error! Currently
you can't partition a zoned disk because those utilities try to
WRITE shadow copies further out on the disk and violate the
write pointer settings of swr zones (then crash and burn).
You can create a BTR file system on a whole zoned disk (e.g. /dev/sdb)
but only if you have a recent enough btrfs-prog package ****. Any
Debian user caught in this bind, try using the binary Sid package at:
      https://packages.debian.org/sid/btrfs-progs

Life is a little easier fo ZBC/ZAC zoned disks which typically
start with conventional (normal random WRITE capable) zones (for 1%
of the available storage) before the swr zones commence. ZNS (for
NVMe) doesn't support conventional zones.

Doug Gilbert

***  well where sd.c generated that "unaligned" error it was because
       it tried to READ one block at LBA 0 and thought it was 4096
       bytes long. It wasn't (due to a MODE SELECT) so it got back
       512 bytes. Is that an alignment error ??

Personally, I consider it as such because the retry to process the
remaining will necessarily fail, or worse, do bad things to the drive
sectors, since the addressing is off by a factor of 8. Retrying the
remaining of any of these "unaligned" commands is dangerous. For a read,
this can lead to data leaks, and for a write, that can destroy the FS on
the disk.

Here are the error messages I saw after the MODE_SELECT+FORMAT_UNIT
commands that changed the LB size from 4096 to 512 bytes. No command
was entered on the command line (after the format). The disk had no
mounted file systems on it.

[10490.819058] sd 2:0:1:0: [sdb] Unaligned partial completion (resid=3584, 
sector_sz=4096)
[10490.819189] sd 2:0:1:0: [sdb] tag#392 CDB: Read(16) 88 00 00 00 00 00 00 00 
00 00 00 00 00 01 00 00
[10490.820349] sd 2:0:1:0: [sdb] Unaligned partial completion (resid=3584, 
sector_sz=4096)
[10490.820356] sd 2:0:1:0: [sdb] tag#393 CDB: Read(16) 88 00 00 00 00 00 00 00 
00 00 00 00 00 01 00 00
[10490.820609] sd 2:0:1:0: [sdb] Unaligned partial completion (resid=3584, 
sector_sz=4096)
[10490.820612] sd 2:0:1:0: [sdb] tag#394 CDB: Read(16) 88 00 00 00 00 00 00 00 
00 00 00 00 00 01 00 00
[10490.820768] sd 2:0:1:0: [sdb] Unaligned partial completion (resid=3584, 
sector_sz=4096)
[10490.820769] sd 2:0:1:0: [sdb] tag#395 CDB: Read(16) 88 00 00 00 00 00 00 00 
00 00 00 00 00 01 00 00

That continued and the machine became unusable so I rebooted it.

The log shows that it is trying to read the partition table, that failed,
lets try it again (ad infinitum).
Surely to goodness that is BUG. And the information it needs is there:
wanted 4096 bytes, got 512, try again ... same result ... does that look
like a transport error? Not IMO.

What should it do? Well doing a READ CAPACITY would be a great start.

Doug Gilbert