Re: ssd keeps dying (OT)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



The 2 crucial bx drives I was losing, I replaced with an older smaller
mx drive and that one has been working just fine for a couple of
months, thinking about my issue and Neal's issue here is what springs
to mind.

So in my case, if mine was a power supply issue, it would have to be
that something about the new ssds is excessively sensitive to power or
ground loops.   The thought of my issue being a power supply
issue/sata issue burning the device did occur to me.  And that issue I
have is heavily reported in the 1-star reviews for the crucial device,
several people having more  than 1 failure and returning the device
for refund.  The people that have the failure seem to be able to
repeat, and I assume others work just fine.   So it would seem that
there must be some component used in recent ssd's may be super
sensitive to something either power supply wise or sata port wise, or
the design has a internal grounding issue and is sensitive to ground
loop wise that does not cause an issue with the older devices (I have
2 older SSD's and 8 hard drives that have been running in said machine
for months to years just fine).  I would think on an NVME device that
it would be well grounded to the motherboard/case.  In my case my ssds
were in a plastic drive holder so the only ground would have been via
the sata connection and the power supply, and so if the drive design
had components expecting a screw hole ground that won't exist in some
cases, and could have floating voltages then that might damage
something.

How was your nvme drive mounted in your case?   On mine the normal
screw holes were not connected to ground (plastic drive case) so the
"chassis" of the drive would not have been externally grounded, and
had said drive unit chassis not had a direct connect to to power or
SATA ground that could end up with floating voltages on the drive
chassis and any components tied to it internally.

And ground loops are tricky.  I have a wind meter on my roof hooked to
a device that counts it's rotations, and that serial port device would
randomly stop working requiring a reset of the usb-to-serial
communication to get it to function again (I had a cron job to
reload/reset the usb nightly because it was happening often enough). I
guessed ground loop ran a ground wire to house ground and grounded the
hw device doing the counting years ago, and that solved the issue.

On Tue, Feb 22, 2022 at 9:47 AM George N. White III <gnwiii@xxxxxxxxx> wrote:
>
> On Tue, 22 Feb 2022 at 10:04, Neal Becker <ndbecker2@xxxxxxxxx> wrote:
>>
>> Thanks Richard.  Yes, I talked with Titan; they suggested trying the pcie-m.2 adapter.  I will try them again.
>> I have not checked for bios updates.  Not sure how to go about that (last time I did that it required an msdos floppy disc).
>>
>> Haven't tried the SSDs in another device because I don't have one.  But the fact that replacing the SSD causes it to work, where it wasn't working before, tells me they were damaged.  I have at least once power off/on the workstation, and the bios did not find any ssd to boot from.  So power cycle didn't fix it, but replace ssd did fix it.
>>
>> I will try Titan again later today, but just looking for ideas.
>
>
> With this history, I'd probably replace the workstation power supply.   I would also scan the
> the system board for capacitors on bulging tops or overheated components.
>
> Are there any externally powered devices connected to the workstation (other than the monitor)?
>
> Are you in an area with frequent lightning storms?  How stable is your power?  Is the system
> connected to a UPS?
>
> I had a similar experience with spinning disks in a system that contained a drive-bay radio receiver
> and was connected to a satellite dish and GPS receiver on the roof, and an antenna controller.  Everything
> was powered by a high quality UPS.  I added a heavy wire connecting the antenna controller case to the
> workstation case and the failures stopped.
>
> I gather you now have space for two m.2 SSD's.   If you haven't discarded the non-working devices,
> it would be interesting to see if any are detected and what smartmontools says about them, but
> you also have the option to put /var on a separate drive.  Smartmon tools can monitor a drive and
> report any problems it detects, but you may also want to run self-tests periodically.
>
>
>>
>>
>> Thanks,
>> Neal
>>
>> On Tue, Feb 22, 2022 at 8:44 AM Richard Shaw <hobbes1069@xxxxxxxxx> wrote:
>>>
>>> On Tue, Feb 22, 2022 at 7:34 AM Neal Becker <ndbecker2@xxxxxxxxx> wrote:
>>>>
>>>> I know this is a bit OT, but you guys are great at answering all questions.
>>>>
>>>> I bought a workstation from Titan computers around 1/2020 (dual EPYC cpu).  After about 1 year it stopped working.  I could ssh to it, and almost any command would return Input/Output error.  Unfortunately journalctl gave input/output error so I can't see logs.  cat /proc/partitions did not show any nvme device (the root device) on which the OS was installed.
>>>>
>>>> I replaced the SSD with a samsung 980 pro.  Reinstalled fedora.  It then worked a few weeks, then the exact same symptoms.
>>>>
>>>> I replaced the SSD with another samsung 980 pro, this time with heatsink.  Reinstalled fedora.  It worked a few weeks.  Then same symptoms.
>>>>
>>>> Then I replaced with a 4th samsung 980 pro, but this time instead of using the M.2 socket I used a pcie-m.2 adapter (in case something was wrong with the m.2 socket).  Also added a surge protector outlet for good measure. Reinstalled.  Watched the smartctl.  No errors.  Temperature was always low.
>>>>
>>>> Now it's failed again, exactly same symptoms.
>>>>
>>>> Any ideas?
>>>
>>>
>>> I remember your other email about a month or so ago and thought it was really strange. Have you tried the drives in another system to confirm they're truly dead?
>>>
>>> I would check for BIOS updates just for good measure. Other than that, have you had any communication with Titan about it?
>>>
>>> Thanks,
>>> Richard
>>> _______________________________________________
>>> users mailing list -- users@xxxxxxxxxxxxxxxxxxxxxxx
>>> To unsubscribe send an email to users-leave@xxxxxxxxxxxxxxxxxxxxxxx
>>> Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
>>> List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
>>> List Archives: https://lists.fedoraproject.org/archives/list/users@xxxxxxxxxxxxxxxxxxxxxxx
>>> Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure
>>
>>
>>
>> --
>> Those who don't understand recursion are doomed to repeat it
>> _______________________________________________
>> users mailing list -- users@xxxxxxxxxxxxxxxxxxxxxxx
>> To unsubscribe send an email to users-leave@xxxxxxxxxxxxxxxxxxxxxxx
>> Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
>> List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
>> List Archives: https://lists.fedoraproject.org/archives/list/users@xxxxxxxxxxxxxxxxxxxxxxx
>> Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure
>
>
>
> --
> George N. White III
>
> _______________________________________________
> users mailing list -- users@xxxxxxxxxxxxxxxxxxxxxxx
> To unsubscribe send an email to users-leave@xxxxxxxxxxxxxxxxxxxxxxx
> Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
> List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
> List Archives: https://lists.fedoraproject.org/archives/list/users@xxxxxxxxxxxxxxxxxxxxxxx
> Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure
_______________________________________________
users mailing list -- users@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to users-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/users@xxxxxxxxxxxxxxxxxxxxxxx
Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure



[Index of Archives]     [Older Fedora Users]     [Fedora Announce]     [Fedora Package Announce]     [EPEL Announce]     [EPEL Devel]     [Fedora Magazine]     [Fedora Summer Coding]     [Fedora Laptop]     [Fedora Cloud]     [Fedora Advisory Board]     [Fedora Education]     [Fedora Security]     [Fedora Scitech]     [Fedora Robotics]     [Fedora Infrastructure]     [Fedora Websites]     [Anaconda Devel]     [Fedora Devel Java]     [Fedora Desktop]     [Fedora Fonts]     [Fedora Marketing]     [Fedora Management Tools]     [Fedora Mentors]     [Fedora Package Review]     [Fedora R Devel]     [Fedora PHP Devel]     [Kickstart]     [Fedora Music]     [Fedora Packaging]     [Fedora SELinux]     [Fedora Legal]     [Fedora Kernel]     [Fedora OCaml]     [Coolkey]     [Virtualization Tools]     [ET Management Tools]     [Yum Users]     [Yosemite News]     [Gnome Users]     [KDE Users]     [Fedora Art]     [Fedora Docs]     [Fedora Sparc]     [Libvirt Users]     [Fedora ARM]

  Powered by Linux