I've had a workstation (dual amd rome) for about 2 years. The M2 ssd died after about 1 year. I replaced it with a samsung 980 pro, which then lasted almost 1 more year. Then I replaced it with a 1TB samsung 980 pro, this time with heat sink. This lasted a few weeks. I had been looking at smart nvme data and saw no problems, temp was fine.
Now it's dead again. I can ssh to the machine. I can cat /proc/partitions and no nvme shown. I can only issue a couple of commands (I guess whatever is builtin to bash?), but almost all just give I/O error. no sudo or journatctl.
The machine is only used for compute, not heavy I/O so not caused by ssd wear (and smartctl showed no wear at all).
Any ideas?
Thanks, Neal
On 1/16/22 11:44 AM, Neal Becker wrote:
Now it's dead again. I can ssh to the machine. I can cat /proc/partitions and no nvme shown. I can only issue a couple of commands (I guess whatever is builtin to bash?), but almost all just give I/O error. no sudo or journatctl.
I presume that you're logged in as yourself, so you might try this:
source .bashrc Source is a bash builtin so will work and this should get back your normal CLI environment. It shouldn't really be needed, but might help.
The machine is only used for compute, not heavy I/O so not caused by ssd wear (and smartctl showed no wear at all).
I'm no hardware geek, but it sounds like the connections to the drive are bad.
It just occurred to me that /home might be on that drive. If so, source won't do any good and that will explain why you can't get much done through ssh.
Well I guess I can try reseating it, good idea. Unfortunately this server is remote from me, so might as well collect ideas before driving over.
On Sun, Jan 16, 2022, 3:00 PM Joe Zeff joe@zeff.us wrote:
On 1/16/22 11:44 AM, Neal Becker wrote:
Now it's dead again. I can ssh to the machine. I can cat /proc/partitions and no nvme shown. I can only issue a couple of commands (I guess whatever is builtin to bash?), but almost all just give I/O error. no sudo or journatctl.
I presume that you're logged in as yourself, so you might try this:
source .bashrc Source is a bash builtin so will work and this should get back your normal CLI environment. It shouldn't really be needed, but might help.
The machine is only used for compute, not heavy I/O so not caused by ssd wear (and smartctl showed no wear at all).
I'm no hardware geek, but it sounds like the connections to the drive are bad.
It just occurred to me that /home might be on that drive. If so, source won't do any good and that will explain why you can't get much done through ssh. _______________________________________________ users mailing list -- users@lists.fedoraproject.org To unsubscribe send an email to users-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure
On Sun, 16 Jan 2022 at 16:32, Neal Becker ndbecker2@gmail.com wrote:
Well I guess I can try reseating it, good idea. Unfortunately this server is remote from me, so might as well collect ideas before driving over.
Have you seen: Fix your dead SSD with the power cycle method - The Silicon Underground (dfarq.homeip.net https://dfarq.homeip.net/fix-dead-ssd/
I wonder if some workloads designed for rotating disks end up rewriting the same storage location resulting in early death of SSD's
I would check for reports of similar issues from other users with the same hardware. Also consider if the system is in an harsh environment: high humidity, frequent thunderstorms, extreme temperatures. CPU cooling can be an issue with dual AMD CPU's in cases intended for CPU's with lower heat output. Overheating of CPU's and chipsets can cause all sorts of failures.
Contact enhancer <Contact Enhancer Fluid ECH CE1 | Buy Online - NAPA Auto Parts (napaonline.com) https://www.napaonline.com/en/p/ECHCE1>. can help with harsh environments or marginal quality connectors. At work we bought a bunch of systems that had low-quality drive cables that would fall off when a system was moved due to low spring tension. It is possible some cheap M.2 sockets could have similar issues.
--
George N. White III
On Sun, 2022-01-16 at 13:44 -0500, Neal Becker wrote:
I've had a workstation (dual amd rome) for about 2 years. The M2 ssd died after about 1 year. I replaced it with a samsung 980 pro, which then lasted almost 1 more year. Then I replaced it with a 1TB samsung 980 pro, this time with heat sink. This lasted a few weeks. I had been looking at smart nvme data and saw no problems, temp was fine.
With a plethora of killed hardware, I would be inclined to think power supply fault. I'd swap the power supply, presuming yours has a fault that drives don't tolerate as well as the motherboard. Power supplies do fail, sometime all by themselves, sometimes due to external surges (even if you go through a UPS, that's still not a guarantee that the UPS protects your PSU, it would depend on how your UPS works).
On the latter note, surge protectors cannot be relied upon to protect equipment from being killed by surges. They may once or twice, but there are transient surges all day long, every day. So, they have a hard job to do. I put it this way; rather than protecting your device from being killed by a surge, they're more about protecting your house from being burnt down by a surge that could otherwise set fire to a device. It'd be great if they could do both, but I'll settle for the later as a bare minimum.
NB: My background is in electronics engineering and servicing.
On Sun, Jan 16, 2022 at 6:11 PM George N. White III gnwiii@gmail.com wrote:
On Sun, 16 Jan 2022 at 16:32, Neal Becker ndbecker2@gmail.com wrote:
Well I guess I can try reseating it, good idea. Unfortunately this server is remote from me, so might as well collect ideas before driving over.
Have you seen: Fix your dead SSD with the power cycle method - The Silicon Underground (dfarq.homeip.net
I wonder if some workloads designed for rotating disks end up rewriting the same storage location resulting in early death of SSD's
The ssd firmware plays games to make wear leveling work. Because of that it is very unlikely that 2 blocks written to the same "block" at the fs level are going to be written to the same block on the ssd.
Make sure the SSD has some cache memory, I bought (and since returned) a crucial ssd that did not have ram/cache. I killed 2 of the (orig and replacement) in under 2 weeks, before I returned it. I killed them such that they would no longer even answer on the sata bus. However the firmware for these works, they seem to be a lot less reliable. And there are a lot of similar reviews that under some set of conditions these devices are unreliable. Whatever algorithm choices that a few dollars cheaper/no cache ram causes seems to be a problem.
I have also been known to carefully and lightly use a pencil eraser on the contacts to clean off anything that could cause it to not have a good contact, and then wipe it with alcohol.