Re: sd: Unaligned partial completion

Damien Le Moal <damien.lemoal@xxxxxxxxxxxxxxxxxx> · Thu, 24 Feb 2022 07:47:25 +0900

On 2/24/22 06:37, Douglas Gilbert wrote:
> On 2022-02-22 22:27, Martin K. Petersen wrote:
>>
>> Douglas,
>>
>>> No, of course not. But the kernel should inspect all UAs especially
>>> the one that says: CAPACITY DATA HAS CHANGED !
>>
>> It does. And uses it to emit an event to userland.
>>
>> In most cases when capacity has changed it is because the user grew
>> their LUN. And doing the right thing in that case is to let userland
>> decide how to deal with it.
>>
>> You could argue that the kernel should do something if the device
>> capacity shrinks. But it is unclear to me what "the right thing" is in
>> all cases. What if there is not a mounted filesystem in the affected
>> block range? Maybe the reduced block range it is not even described by
>> an entry in the partition table? Should we do something? How does SCSI
>> know how much of the capacity is actively in use, if any? Also, what's a
>> partition?
>>
>> In addition, given our experience with NVMe devices which, for $OTHER_OS
>> reasons, truncated their capacity when they experienced media problems,
>> I am not sure we want to encourage anybody ever going down this
>> path. What a mess!
> 
> But this misses my point. sbc5r01.pdf says this:
> 
>    "If the device server supports changing the block descriptor parameters
>     by a MODE SELECT command and the number of logical blocks or the
>     logical block length is changed, then the device server establishes
>     a unit attention condition of:
>        a) CAPACITY DATA HAS CHANGED as described in 4.10; and
>        b) MODE PARAMETERS CHANGED as described in SPC-6.
> 
> My point is: if "the logical block length is changed" then the sd driver
> can NOT reliably issue any IO commands (READ or WRITE) until it does a
> READ CAPACITY command to find out whether the LB size has changed, and
> if so, to what.
> 
>>> Also more and more settings in SCSI *** are giving the option to
>>> return an error (even MEDIUM ERROR) if the initiator is reading a
>>> block that has never been written. So if the sd driver is looking for
>>> a partition table (LBA 0 ?)  then you have a chicken and egg problem
>>> that retrying will not solve.
>>
>> For a general purpose OS it is completely unreasonable to expect that
>> the OS has prior knowledge about which blocks were written. How is that
>> even supposed to work if you plug in a USB drive from a different
>> machine/OS? It also breaks the notion of array disks being
>> self-describing which is now effectively an industry requirement.
>>
>> I am very happy to take patches that prevent us from retrying forever
>> when a device is being disagreeable. But I am also very comfortable with
>> the notion of not bothering to support devices that behave in a
>> nonsensical way. Just because the SCSI spec allows something doesn't
>> mean we should support it.
>>
>>> The sd driver should take its lead from SBC, not ZBC.
>>
>> The sd driver is the driver for both protocols.
> 
> This "unaligned" usage seems to come from ZBC and first appeared in
> SPC-4, ASC/ACSQ code [0x21,0x4]: "Unaligned WRITE command". It is
> the only use of the word "unaligned" in SPC-4, SPC-5 and spc6r06.pdf
> and it is not defined (in those documents) or in the SBC specs.
> Surprisingly it is used, but not defined in zbc2r12.pdf .
> 
> To me "unaligned" means some sort of transport issue which this is
> not ***. It simply means the WRITE was not issued with a starting
> LBA which corresponded to that zone's write pointer. This is
> for "sequential write required" (swr)zones. IMO the ASC message
> should be akin to: "Sequential write requirement violated".
> 
> Until Linux utilities catch up with zoned disks, users of zoned
> disks are going to see a lot of that "unaligned"  error! Currently
> you can't partition a zoned disk because those utilities try to
> WRITE shadow copies further out on the disk and violate the
> write pointer settings of swr zones (then crash and burn).
> You can create a BTR file system on a whole zoned disk (e.g. /dev/sdb)
> but only if you have a recent enough btrfs-prog package ****. Any
> Debian user caught in this bind, try using the binary Sid package at:
>      https://packages.debian.org/sid/btrfs-progs
> 
> 
> Life is a little easier fo ZBC/ZAC zoned disks which typically
> start with conventional (normal random WRITE capable) zones (for 1%
> of the available storage) before the swr zones commence. ZNS (for
> NVMe) doesn't support conventional zones.
> 
> Doug Gilbert
> 
> 
> ***  well where sd.c generated that "unaligned" error it was because
>       it tried to READ one block at LBA 0 and thought it was 4096
>       bytes long. It wasn't (due to a MODE SELECT) so it got back
>       512 bytes. Is that an alignment error ??

Personally, I consider it as such because the retry to process the
remaining will necessarily fail, or worse, do bad things to the drive
sectors, since the addressing is off by a factor of 8. Retrying the
remaining of any of these "unaligned" commands is dangerous. For a read,
this can lead to data leaks, and for a write, that can destroy the FS on
the disk.

> 
> **** building btrfs-prog from its github source is not a pleasant
>       experience, IMO

-- 
Damien Le Moal
Western Digital Research