Re: [v4 00/11] Add persistent durable identifier to storage log messages

Douglas Gilbert <dgilbert@xxxxxxxxxxxx> · Mon, 27 Jul 2020 15:17:58 -0400

On 2020-07-27 1:42 p.m., Tony Asleson wrote:
On 7/27/20 11:46 AM, Hannes Reinecke wrote:
On 7/27/20 5:45 PM, Tony Asleson wrote:
On 7/26/20 10:10 AM, Christoph Hellwig wrote:
FYI, I think these identifiers are absolutely horrible and have no
business in dmesg:

The identifiers are structured data, they're not visible unless you go
looking for them.

I'm open to other suggestions on how we can positively identify storage
devices over time, across reboots, replacement, and dynamic
reconfiguration.

My home system has 4 disks, 2 are identical except for serial number.
Even with this simple configuration, it's not trivial to identify which
message goes with which disk across reboots.

Well; the more important bits would be to identify the physical location
where these disks reside.
If one goes bad it doesn't really help you if have a persistent
identification in the OS; what you really need is a physical indicator
or a physical location allowing you to identify which disk to pull.

In my use case I have no slot information.  I have no SCSI enclosure
services to toggle identification LEDs or fault LEDs for the drive sled.
  For some users the device might be a virtual one in a storage server,
vpd helps.

In my case the in kernel vpd (WWN) data can be used to correlate with
the sticker on the disk as the disks have the WWN printed on them.  I
would think this is true for most disks/storage devices, but obviously I
can't make that statement with 100% certainty as I have a small sample size.

Which isn't addressed at all with this patchset (nor should it; the
physical location it typically found via other means).

And for the other use-cases: We do have persistent device links, do we
not?

How does /dev/disk/by-* help when you are looking at the journal from 1
or more reboots ago and the only thing you have in your journal is
something like:

blk_update_request: critical medium error, dev sde, sector 43578 op
0x0:(READ) flags 0x0 phys_seg 1 prio class 0

The links are only valid for right now.

Does:
   lsscsi -U
or
   lsscsi -UU

solve your problem, or come close?

Example:
# lsscsi -UU
[1:0:0:0]    disk    naa.5000cca02b38d0b8  /dev/sda
[1:0:1:0]    disk    naa.5000c5003011cb2b  /dev/sdb
[1:0:2:0]    enclosu naa.5001b4d516ecc03f  -
[N:0:1:1]    disk    eui.e8238fa6bf530001001b448b46bd5525    /dev/nvme0n1

The first two (SAS SSDs) NAAs are printed on the disk labels. I don't
think either that enclosure or the M2 NVMe SSD have their numbers
visible (i.e. the last two lines of output).

If it is what you want, then perhaps you could arrange for its output
to be sent to the log when the system has stabilized after a reboot. That
would only leave disk hotplug events exposed.

Faced with the above medium error I would try:
   dd if=<all_possibles> bs=512 skip=43578 iflag=direct of=/dev/null count=1
and look for noise in the logs. Change 'bs=512' up to 4096 if that is
the logical block size. For <all_possibles> use /dev/sde (and /dev/sdf and
/dev/dev/sdg or whatever) IOWs the _whole_ disk device name.

Doug Gilbert