Re: nvme crash

Chris Murphy <lists@xxxxxxxxxxxxxxxxx> · Sun, 13 Dec 2020 19:20:46 -0700

On Sun, Dec 13, 2020 at 4:42 PM Eyal Lebedinsky <fedora@xxxxxxxxxxxxxx> wrote:
>
> I am not sure which list this should go to, so I am starting here.
>
> I run f32 fully updated
>         5.9.13-100.fc32.x86_64
> on relatively new hardware
>         kernel: DMI: Gigabyte Technology Co., Ltd. Z390 UD/Z390 UD, BIOS F8 05/24/2019

> boot/root/swap/data is on nvme
>         WD Blue SN550 1TB M.2 2280 NVMe SSD WDS100T2B0C

I can't tell from WD's website if there's any newer firmware
available. They seem to hide this information within the Windows-only
software "Western Digital Dashboard". If you have Windows already
installed, it's straightforward to install this and find out if the
firmware is up to date.

There is a boot parameter 'nvme_core.default_ps_max_latency_us' which
takes a value in usec, but I can't find a value specific to this
make/model NVMe. My gut instinct is, it's a hack put in by upstream
kernel developers to work around a proper autodetect solution between
PCIe intereface and the drive. I would sooner return the drive and get
one known to work. I can vouch for Crucial, Seagate, and Samsung SSD
and NVMe for the most part.

Oh here's a bug report
https://bugzilla.redhat.com/show_bug.cgi?id=1844905

That leads here:
https://bugzilla.kernel.org/show_bug.cgi?id=208123

comment 1 is a more solid lead than comment 2, because comment 2 is a
value that is based on what? A guess? Reading the rest of the thread,
it's still uncertain.

> For the second time this disk stopped working (first was about two months ago).
> It seems that the disk failed hard and could not be reset, the machine was powered off/on.
> I think (not sure) that last time I just hit the reset button but it did not boot.
>
> The machine was booted (after dnf update) around 8pm, and crashed at 4am.
>
> Following the earlier crash a serial console was set up which is how I can see the failure messages.
>
> == nvme related messages
> [    7.488638] nvme nvme0: pci function 0000:06:00.0
> [    7.536593] nvme nvme0: allocated 32 MiB host memory buffer.
> [    7.541819] nvme nvme0: 8/0/0 default/read/poll queues
> [    7.558122]  nvme0n1: p1 p2 p3 p4
> [   19.590010] EXT4-fs (nvme0n1p3): mounted filesystem with ordered data mode. Opts: (null)
> [   20.653500] Adding 16777212k swap on /dev/nvme0n1p2.  Priority:-2 extents:1 across:16777212k SSFS
> [   20.820539] EXT4-fs (nvme0n1p3): re-mounted. Opts: (null)
> [   23.137206] EXT4-fs (nvme0n1p1): mounted filesystem with ordered data mode. Opts: (null)
> [   23.210717] EXT4-fs (nvme0n1p4): mounted filesystem with ordered data mode. Opts: (null)
> ## nothing unusual for 8 hours, then
> [28972.459036] nvme nvme0: I/O 840 QID 6 timeout, aborting
> [28972.464757] nvme nvme0: I/O 565 QID 7 timeout, aborting
> [28972.470277] nvme nvme0: I/O 566 QID 7 timeout, aborting
> [28973.291025] nvme nvme0: I/O 989 QID 1 timeout, aborting
> [28978.603061] nvme nvme0: I/O 990 QID 1 timeout, aborting
> [29002.667243] nvme nvme0: I/O 840 QID 6 timeout, reset controller
> [29032.875421] nvme nvme0: I/O 24 QID 0 timeout, reset controller
> [29074.097644] nvme nvme0: Device not ready; aborting reset, CSTS=0x1
> [29074.110354] nvme nvme0: Abort status: 0x371
> [29074.114953] nvme nvme0: Abort status: 0x371
> [29074.119523] nvme nvme0: Abort status: 0x371
> [29074.124114] nvme nvme0: Abort status: 0x371
> [29074.128710] nvme nvme0: Abort status: 0x371
> [29096.645478] nvme nvme0: Device not ready; aborting reset, CSTS=0x1
> [29096.652210] nvme nvme0: Removing after probe failure status: -19
> [29119.165921] nvme nvme0: Device not ready; aborting reset, CSTS=0x1
> ## many I/O errors on nvme0 (p2/p3/p4) repeating until a reboot at 8:30am
> ## one different message, appearing just once:
> [29123.800844] nvme nvme0: failed to set APST feature (-19)

I'd take the position that it's defective and permit the manufacturer
a short leash to convince me otherwise via a tech support call or
email. But I really wouldn't just wait around for another 2 months not
knowing if it's going to fail again. I'd like some kind of answer for
this problem from support folks. And if they can't give support, get
rid of it.

The time frame for a repeat of the problem is why I'm taking this
slightly different view, than the tinker with firmware view earlier.
It's not horrible to update firmware, and give it a go, if this
problem happens once a week or more often. But every two months?
Forget it. Make it their problem.

And seriously I give them one chance. If they b.s. me and it flakes
out again in a month or two, no more chances. So the quandary is,
what's your return policy window? If it's about to end, just return it
now. It should just work out of the box. WDC does contribute to the
kernel. Whether this is a product supported on Linux I don't know.

-- 
Chris Murphy
_______________________________________________
users mailing list -- users@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to users-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/users@xxxxxxxxxxxxxxxxxxxxxxx