The 2 crucial bx drives I was losing, I replaced with an older smaller mx drive and that one has been working just fine for a couple of months, thinking about my issue and Neal's issue here is what springs to mind. So in my case, if mine was a power supply issue, it would have to be that something about the new ssds is excessively sensitive to power or ground loops. The thought of my issue being a power supply issue/sata issue burning the device did occur to me. And that issue I have is heavily reported in the 1-star reviews for the crucial device, several people having more than 1 failure and returning the device for refund. The people that have the failure seem to be able to repeat, and I assume others work just fine. So it would seem that there must be some component used in recent ssd's may be super sensitive to something either power supply wise or sata port wise, or the design has a internal grounding issue and is sensitive to ground loop wise that does not cause an issue with the older devices (I have 2 older SSD's and 8 hard drives that have been running in said machine for months to years just fine). I would think on an NVME device that it would be well grounded to the motherboard/case. In my case my ssds were in a plastic drive holder so the only ground would have been via the sata connection and the power supply, and so if the drive design had components expecting a screw hole ground that won't exist in some cases, and could have floating voltages then that might damage something. How was your nvme drive mounted in your case? On mine the normal screw holes were not connected to ground (plastic drive case) so the "chassis" of the drive would not have been externally grounded, and had said drive unit chassis not had a direct connect to to power or SATA ground that could end up with floating voltages on the drive chassis and any components tied to it internally. And ground loops are tricky. I have a wind meter on my roof hooked to a device that counts it's rotations, and that serial port device would randomly stop working requiring a reset of the usb-to-serial communication to get it to function again (I had a cron job to reload/reset the usb nightly because it was happening often enough). I guessed ground loop ran a ground wire to house ground and grounded the hw device doing the counting years ago, and that solved the issue. On Tue, Feb 22, 2022 at 9:47 AM George N. White III <gnwiii@xxxxxxxxx> wrote: > > On Tue, 22 Feb 2022 at 10:04, Neal Becker <ndbecker2@xxxxxxxxx> wrote: >> >> Thanks Richard. Yes, I talked with Titan; they suggested trying the pcie-m.2 adapter. I will try them again. >> I have not checked for bios updates. Not sure how to go about that (last time I did that it required an msdos floppy disc). >> >> Haven't tried the SSDs in another device because I don't have one. But the fact that replacing the SSD causes it to work, where it wasn't working before, tells me they were damaged. I have at least once power off/on the workstation, and the bios did not find any ssd to boot from. So power cycle didn't fix it, but replace ssd did fix it. >> >> I will try Titan again later today, but just looking for ideas. > > > With this history, I'd probably replace the workstation power supply. I would also scan the > the system board for capacitors on bulging tops or overheated components. > > Are there any externally powered devices connected to the workstation (other than the monitor)? > > Are you in an area with frequent lightning storms? How stable is your power? Is the system > connected to a UPS? > > I had a similar experience with spinning disks in a system that contained a drive-bay radio receiver > and was connected to a satellite dish and GPS receiver on the roof, and an antenna controller. Everything > was powered by a high quality UPS. I added a heavy wire connecting the antenna controller case to the > workstation case and the failures stopped. > > I gather you now have space for two m.2 SSD's. If you haven't discarded the non-working devices, > it would be interesting to see if any are detected and what smartmontools says about them, but > you also have the option to put /var on a separate drive. Smartmon tools can monitor a drive and > report any problems it detects, but you may also want to run self-tests periodically. > > >> >> >> Thanks, >> Neal >> >> On Tue, Feb 22, 2022 at 8:44 AM Richard Shaw <hobbes1069@xxxxxxxxx> wrote: >>> >>> On Tue, Feb 22, 2022 at 7:34 AM Neal Becker <ndbecker2@xxxxxxxxx> wrote: >>>> >>>> I know this is a bit OT, but you guys are great at answering all questions. >>>> >>>> I bought a workstation from Titan computers around 1/2020 (dual EPYC cpu). After about 1 year it stopped working. I could ssh to it, and almost any command would return Input/Output error. Unfortunately journalctl gave input/output error so I can't see logs. cat /proc/partitions did not show any nvme device (the root device) on which the OS was installed. >>>> >>>> I replaced the SSD with a samsung 980 pro. Reinstalled fedora. It then worked a few weeks, then the exact same symptoms. >>>> >>>> I replaced the SSD with another samsung 980 pro, this time with heatsink. Reinstalled fedora. It worked a few weeks. Then same symptoms. >>>> >>>> Then I replaced with a 4th samsung 980 pro, but this time instead of using the M.2 socket I used a pcie-m.2 adapter (in case something was wrong with the m.2 socket). Also added a surge protector outlet for good measure. Reinstalled. Watched the smartctl. No errors. Temperature was always low. >>>> >>>> Now it's failed again, exactly same symptoms. >>>> >>>> Any ideas? >>> >>> >>> I remember your other email about a month or so ago and thought it was really strange. Have you tried the drives in another system to confirm they're truly dead? >>> >>> I would check for BIOS updates just for good measure. Other than that, have you had any communication with Titan about it? >>> >>> Thanks, >>> Richard >>> _______________________________________________ >>> users mailing list -- users@xxxxxxxxxxxxxxxxxxxxxxx >>> To unsubscribe send an email to users-leave@xxxxxxxxxxxxxxxxxxxxxxx >>> Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ >>> List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines >>> List Archives: https://lists.fedoraproject.org/archives/list/users@xxxxxxxxxxxxxxxxxxxxxxx >>> Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure >> >> >> >> -- >> Those who don't understand recursion are doomed to repeat it >> _______________________________________________ >> users mailing list -- users@xxxxxxxxxxxxxxxxxxxxxxx >> To unsubscribe send an email to users-leave@xxxxxxxxxxxxxxxxxxxxxxx >> Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ >> List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines >> List Archives: https://lists.fedoraproject.org/archives/list/users@xxxxxxxxxxxxxxxxxxxxxxx >> Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure > > > > -- > George N. White III > > _______________________________________________ > users mailing list -- users@xxxxxxxxxxxxxxxxxxxxxxx > To unsubscribe send an email to users-leave@xxxxxxxxxxxxxxxxxxxxxxx > Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ > List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines > List Archives: https://lists.fedoraproject.org/archives/list/users@xxxxxxxxxxxxxxxxxxxxxxx > Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure _______________________________________________ users mailing list -- users@xxxxxxxxxxxxxxxxxxxxxxx To unsubscribe send an email to users-leave@xxxxxxxxxxxxxxxxxxxxxxx Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/users@xxxxxxxxxxxxxxxxxxxxxxx Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure