On Sun, Dec 13, 2020 at 4:42 PM Eyal Lebedinsky <fedora@xxxxxxxxxxxxxx> wrote: > > I am not sure which list this should go to, so I am starting here. > > I run f32 fully updated > 5.9.13-100.fc32.x86_64 > on relatively new hardware > kernel: DMI: Gigabyte Technology Co., Ltd. Z390 UD/Z390 UD, BIOS F8 05/24/2019 > boot/root/swap/data is on nvme > WD Blue SN550 1TB M.2 2280 NVMe SSD WDS100T2B0C I can't tell from WD's website if there's any newer firmware available. They seem to hide this information within the Windows-only software "Western Digital Dashboard". If you have Windows already installed, it's straightforward to install this and find out if the firmware is up to date. There is a boot parameter 'nvme_core.default_ps_max_latency_us' which takes a value in usec, but I can't find a value specific to this make/model NVMe. My gut instinct is, it's a hack put in by upstream kernel developers to work around a proper autodetect solution between PCIe intereface and the drive. I would sooner return the drive and get one known to work. I can vouch for Crucial, Seagate, and Samsung SSD and NVMe for the most part. Oh here's a bug report https://bugzilla.redhat.com/show_bug.cgi?id=1844905 That leads here: https://bugzilla.kernel.org/show_bug.cgi?id=208123 comment 1 is a more solid lead than comment 2, because comment 2 is a value that is based on what? A guess? Reading the rest of the thread, it's still uncertain. > For the second time this disk stopped working (first was about two months ago). > It seems that the disk failed hard and could not be reset, the machine was powered off/on. > I think (not sure) that last time I just hit the reset button but it did not boot. > > The machine was booted (after dnf update) around 8pm, and crashed at 4am. > > Following the earlier crash a serial console was set up which is how I can see the failure messages. > > == nvme related messages > [ 7.488638] nvme nvme0: pci function 0000:06:00.0 > [ 7.536593] nvme nvme0: allocated 32 MiB host memory buffer. > [ 7.541819] nvme nvme0: 8/0/0 default/read/poll queues > [ 7.558122] nvme0n1: p1 p2 p3 p4 > [ 19.590010] EXT4-fs (nvme0n1p3): mounted filesystem with ordered data mode. Opts: (null) > [ 20.653500] Adding 16777212k swap on /dev/nvme0n1p2. Priority:-2 extents:1 across:16777212k SSFS > [ 20.820539] EXT4-fs (nvme0n1p3): re-mounted. Opts: (null) > [ 23.137206] EXT4-fs (nvme0n1p1): mounted filesystem with ordered data mode. Opts: (null) > [ 23.210717] EXT4-fs (nvme0n1p4): mounted filesystem with ordered data mode. Opts: (null) > ## nothing unusual for 8 hours, then > [28972.459036] nvme nvme0: I/O 840 QID 6 timeout, aborting > [28972.464757] nvme nvme0: I/O 565 QID 7 timeout, aborting > [28972.470277] nvme nvme0: I/O 566 QID 7 timeout, aborting > [28973.291025] nvme nvme0: I/O 989 QID 1 timeout, aborting > [28978.603061] nvme nvme0: I/O 990 QID 1 timeout, aborting > [29002.667243] nvme nvme0: I/O 840 QID 6 timeout, reset controller > [29032.875421] nvme nvme0: I/O 24 QID 0 timeout, reset controller > [29074.097644] nvme nvme0: Device not ready; aborting reset, CSTS=0x1 > [29074.110354] nvme nvme0: Abort status: 0x371 > [29074.114953] nvme nvme0: Abort status: 0x371 > [29074.119523] nvme nvme0: Abort status: 0x371 > [29074.124114] nvme nvme0: Abort status: 0x371 > [29074.128710] nvme nvme0: Abort status: 0x371 > [29096.645478] nvme nvme0: Device not ready; aborting reset, CSTS=0x1 > [29096.652210] nvme nvme0: Removing after probe failure status: -19 > [29119.165921] nvme nvme0: Device not ready; aborting reset, CSTS=0x1 > ## many I/O errors on nvme0 (p2/p3/p4) repeating until a reboot at 8:30am > ## one different message, appearing just once: > [29123.800844] nvme nvme0: failed to set APST feature (-19) I'd take the position that it's defective and permit the manufacturer a short leash to convince me otherwise via a tech support call or email. But I really wouldn't just wait around for another 2 months not knowing if it's going to fail again. I'd like some kind of answer for this problem from support folks. And if they can't give support, get rid of it. The time frame for a repeat of the problem is why I'm taking this slightly different view, than the tinker with firmware view earlier. It's not horrible to update firmware, and give it a go, if this problem happens once a week or more often. But every two months? Forget it. Make it their problem. And seriously I give them one chance. If they b.s. me and it flakes out again in a month or two, no more chances. So the quandary is, what's your return policy window? If it's about to end, just return it now. It should just work out of the box. WDC does contribute to the kernel. Whether this is a product supported on Linux I don't know. -- Chris Murphy _______________________________________________ users mailing list -- users@xxxxxxxxxxxxxxxxxxxxxxx To unsubscribe send an email to users-leave@xxxxxxxxxxxxxxxxxxxxxxx Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/users@xxxxxxxxxxxxxxxxxxxxxxx