I know this is a bit OT, but you guys are great at answering all questions.
I bought a workstation from Titan computers around 1/2020 (dual EPYC cpu). After about 1 year it stopped working. I could ssh to it, and almost any command would return Input/Output error. Unfortunately journalctl gave input/output error so I can't see logs. cat /proc/partitions did not show any nvme device (the root device) on which the OS was installed.
I replaced the SSD with a samsung 980 pro. Reinstalled fedora. It then worked a few weeks, then the exact same symptoms.
I replaced the SSD with another samsung 980 pro, this time with heatsink. Reinstalled fedora. It worked a few weeks. Then same symptoms.
Then I replaced with a 4th samsung 980 pro, but this time instead of using the M.2 socket I used a pcie-m.2 adapter (in case something was wrong with the m.2 socket). Also added a surge protector outlet for good measure. Reinstalled. Watched the smartctl. No errors. Temperature was always low.
Now it's failed again, exactly same symptoms.
Any ideas?
Thanks, Neal
On Tue, Feb 22, 2022 at 7:34 AM Neal Becker ndbecker2@gmail.com wrote:
I know this is a bit OT, but you guys are great at answering all questions.
I bought a workstation from Titan computers around 1/2020 (dual EPYC cpu). After about 1 year it stopped working. I could ssh to it, and almost any command would return Input/Output error. Unfortunately journalctl gave input/output error so I can't see logs. cat /proc/partitions did not show any nvme device (the root device) on which the OS was installed.
I replaced the SSD with a samsung 980 pro. Reinstalled fedora. It then worked a few weeks, then the exact same symptoms.
I replaced the SSD with another samsung 980 pro, this time with heatsink. Reinstalled fedora. It worked a few weeks. Then same symptoms.
Then I replaced with a 4th samsung 980 pro, but this time instead of using the M.2 socket I used a pcie-m.2 adapter (in case something was wrong with the m.2 socket). Also added a surge protector outlet for good measure. Reinstalled. Watched the smartctl. No errors. Temperature was always low.
Now it's failed again, exactly same symptoms.
Any ideas?
I remember your other email about a month or so ago and thought it was really strange. Have you tried the drives in another system to confirm they're truly dead?
I would check for BIOS updates just for good measure. Other than that, have you had any communication with Titan about it?
Thanks, Richard
Thanks Richard. Yes, I talked with Titan; they suggested trying the pcie-m.2 adapter. I will try them again. I have not checked for bios updates. Not sure how to go about that (last time I did that it required an msdos floppy disc).
Haven't tried the SSDs in another device because I don't have one. But the fact that replacing the SSD causes it to work, where it wasn't working before, tells me they were damaged. I have at least once power off/on the workstation, and the bios did not find any ssd to boot from. So power cycle didn't fix it, but replace ssd did fix it.
I will try Titan again later today, but just looking for ideas.
Thanks, Neal
On Tue, Feb 22, 2022 at 8:44 AM Richard Shaw hobbes1069@gmail.com wrote:
On Tue, Feb 22, 2022 at 7:34 AM Neal Becker ndbecker2@gmail.com wrote:
I know this is a bit OT, but you guys are great at answering all questions.
I bought a workstation from Titan computers around 1/2020 (dual EPYC cpu). After about 1 year it stopped working. I could ssh to it, and almost any command would return Input/Output error. Unfortunately journalctl gave input/output error so I can't see logs. cat /proc/partitions did not show any nvme device (the root device) on which the OS was installed.
I replaced the SSD with a samsung 980 pro. Reinstalled fedora. It then worked a few weeks, then the exact same symptoms.
I replaced the SSD with another samsung 980 pro, this time with heatsink. Reinstalled fedora. It worked a few weeks. Then same symptoms.
Then I replaced with a 4th samsung 980 pro, but this time instead of using the M.2 socket I used a pcie-m.2 adapter (in case something was wrong with the m.2 socket). Also added a surge protector outlet for good measure. Reinstalled. Watched the smartctl. No errors. Temperature was always low.
Now it's failed again, exactly same symptoms.
Any ideas?
I remember your other email about a month or so ago and thought it was really strange. Have you tried the drives in another system to confirm they're truly dead?
I would check for BIOS updates just for good measure. Other than that, have you had any communication with Titan about it?
Thanks, Richard _______________________________________________ users mailing list -- users@lists.fedoraproject.org To unsubscribe send an email to users-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure
By dead you mean it just quits answering on the bus at all?
I had a recent crucial 2TB SSD issue. The first one failed in under 10 days, I got a replacement and the 2nd one pretty much did the same thing at about the same time so, I returned it for a refund.
It makes me think that whatever is going on with them is something in the SSD controller related in some way, and maybe both have the same controller. I have previously had a SSD firmware bug cause consistent failures at a given power on hours valu (5+ years ago)e, and I also know in there have been some of that same sort of power on hours defect in other brands that "die" at some POH value.
I would check to see if they have any firmware updates for the drive. Some of the POH failures leave the drive permanently dead, and some stay up long enough (after hitting the magic number of POH hours) to get firmware updated (self test after xxx POH hours would fail, but did not run until it was powered on for an hour after being reset).
There seem to be a lot of ways to screw up SSD firmware such that the devices die.
On Tue, Feb 22, 2022 at 8:04 AM Neal Becker ndbecker2@gmail.com wrote:
Thanks Richard. Yes, I talked with Titan; they suggested trying the pcie-m.2 adapter. I will try them again. I have not checked for bios updates. Not sure how to go about that (last time I did that it required an msdos floppy disc).
Haven't tried the SSDs in another device because I don't have one. But the fact that replacing the SSD causes it to work, where it wasn't working before, tells me they were damaged. I have at least once power off/on the workstation, and the bios did not find any ssd to boot from. So power cycle didn't fix it, but replace ssd did fix it.
I will try Titan again later today, but just looking for ideas.
Thanks, Neal
On Tue, Feb 22, 2022 at 8:44 AM Richard Shaw hobbes1069@gmail.com wrote:
On Tue, Feb 22, 2022 at 7:34 AM Neal Becker ndbecker2@gmail.com wrote:
I know this is a bit OT, but you guys are great at answering all questions.
I bought a workstation from Titan computers around 1/2020 (dual EPYC cpu). After about 1 year it stopped working. I could ssh to it, and almost any command would return Input/Output error. Unfortunately journalctl gave input/output error so I can't see logs. cat /proc/partitions did not show any nvme device (the root device) on which the OS was installed.
I replaced the SSD with a samsung 980 pro. Reinstalled fedora. It then worked a few weeks, then the exact same symptoms.
I replaced the SSD with another samsung 980 pro, this time with heatsink. Reinstalled fedora. It worked a few weeks. Then same symptoms.
Then I replaced with a 4th samsung 980 pro, but this time instead of using the M.2 socket I used a pcie-m.2 adapter (in case something was wrong with the m.2 socket). Also added a surge protector outlet for good measure. Reinstalled. Watched the smartctl. No errors. Temperature was always low.
Now it's failed again, exactly same symptoms.
Any ideas?
I remember your other email about a month or so ago and thought it was really strange. Have you tried the drives in another system to confirm they're truly dead?
I would check for BIOS updates just for good measure. Other than that, have you had any communication with Titan about it?
Thanks, Richard _______________________________________________ users mailing list -- users@lists.fedoraproject.org To unsubscribe send an email to users-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure
-- Those who don't understand recursion are doomed to repeat it _______________________________________________ users mailing list -- users@lists.fedoraproject.org To unsubscribe send an email to users-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure
On Tue, 22 Feb 2022 at 10:04, Neal Becker ndbecker2@gmail.com wrote:
Thanks Richard. Yes, I talked with Titan; they suggested trying the pcie-m.2 adapter. I will try them again. I have not checked for bios updates. Not sure how to go about that (last time I did that it required an msdos floppy disc).
Haven't tried the SSDs in another device because I don't have one. But the fact that replacing the SSD causes it to work, where it wasn't working before, tells me they were damaged. I have at least once power off/on the workstation, and the bios did not find any ssd to boot from. So power cycle didn't fix it, but replace ssd did fix it.
I will try Titan again later today, but just looking for ideas.
With this history, I'd probably replace the workstation power supply. I would also scan the the system board for capacitors on bulging tops or overheated components.
Are there any externally powered devices connected to the workstation (other than the monitor)?
Are you in an area with frequent lightning storms? How stable is your power? Is the system connected to a UPS?
I had a similar experience with spinning disks in a system that contained a drive-bay radio receiver and was connected to a satellite dish and GPS receiver on the roof, and an antenna controller. Everything was powered by a high quality UPS. I added a heavy wire connecting the antenna controller case to the workstation case and the failures stopped.
I gather you now have space for two m.2 SSD's. If you haven't discarded the non-working devices, it would be interesting to see if any are detected and what smartmontools says about them, but you also have the option to put /var on a separate drive. Smartmon tools can monitor a drive and report any problems it detects, but you may also want to run self-tests periodically.
Thanks, Neal
On Tue, Feb 22, 2022 at 8:44 AM Richard Shaw hobbes1069@gmail.com wrote:
On Tue, Feb 22, 2022 at 7:34 AM Neal Becker ndbecker2@gmail.com wrote:
I know this is a bit OT, but you guys are great at answering all questions.
I bought a workstation from Titan computers around 1/2020 (dual EPYC cpu). After about 1 year it stopped working. I could ssh to it, and almost any command would return Input/Output error. Unfortunately journalctl gave input/output error so I can't see logs. cat /proc/partitions did not show any nvme device (the root device) on which the OS was installed.
I replaced the SSD with a samsung 980 pro. Reinstalled fedora. It then worked a few weeks, then the exact same symptoms.
I replaced the SSD with another samsung 980 pro, this time with heatsink. Reinstalled fedora. It worked a few weeks. Then same symptoms.
Then I replaced with a 4th samsung 980 pro, but this time instead of using the M.2 socket I used a pcie-m.2 adapter (in case something was wrong with the m.2 socket). Also added a surge protector outlet for good measure. Reinstalled. Watched the smartctl. No errors. Temperature was always low.
Now it's failed again, exactly same symptoms.
Any ideas?
I remember your other email about a month or so ago and thought it was really strange. Have you tried the drives in another system to confirm they're truly dead?
I would check for BIOS updates just for good measure. Other than that, have you had any communication with Titan about it?
Thanks, Richard _______________________________________________ users mailing list -- users@lists.fedoraproject.org To unsubscribe send an email to users-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure
-- *Those who don't understand recursion are doomed to repeat it* _______________________________________________ users mailing list -- users@lists.fedoraproject.org To unsubscribe send an email to users-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure
Well I suspected the PS, but the guy I spoke with at Titan said some other things would fail before the SSD if that was the problem.
The power should be pretty stable, and I did connect to a good transient suppressor strip. Anyway there was no lightning when it died, which was in the past 24 hours. I had been watching smartmon every few days and it showed no error and temps <37C.
Titan has suggested installing a sata SSD (eliminate the m.2) and I'm going to try that. He suggested it might be a software issue, that something might be e.g. erasing the partition table on the drive (I don't have another machine handy to verify this), but this seems really unlikely. I just installed F35 and a moderate set of scientific packages, no proprietary software. The only access in via ssh inside of vpn and I have the only account.
On Tue, Feb 22, 2022 at 10:47 AM George N. White III gnwiii@gmail.com wrote:
On Tue, 22 Feb 2022 at 10:04, Neal Becker ndbecker2@gmail.com wrote:
Thanks Richard. Yes, I talked with Titan; they suggested trying the pcie-m.2 adapter. I will try them again. I have not checked for bios updates. Not sure how to go about that (last time I did that it required an msdos floppy disc).
Haven't tried the SSDs in another device because I don't have one. But the fact that replacing the SSD causes it to work, where it wasn't working before, tells me they were damaged. I have at least once power off/on the workstation, and the bios did not find any ssd to boot from. So power cycle didn't fix it, but replace ssd did fix it.
I will try Titan again later today, but just looking for ideas.
With this history, I'd probably replace the workstation power supply. I would also scan the the system board for capacitors on bulging tops or overheated components.
Are there any externally powered devices connected to the workstation (other than the monitor)?
Are you in an area with frequent lightning storms? How stable is your power? Is the system connected to a UPS?
I had a similar experience with spinning disks in a system that contained a drive-bay radio receiver and was connected to a satellite dish and GPS receiver on the roof, and an antenna controller. Everything was powered by a high quality UPS. I added a heavy wire connecting the antenna controller case to the workstation case and the failures stopped.
I gather you now have space for two m.2 SSD's. If you haven't discarded the non-working devices, it would be interesting to see if any are detected and what smartmontools says about them, but you also have the option to put /var on a separate drive. Smartmon tools can monitor a drive and report any problems it detects, but you may also want to run self-tests periodically.
Thanks, Neal
On Tue, Feb 22, 2022 at 8:44 AM Richard Shaw hobbes1069@gmail.com wrote:
On Tue, Feb 22, 2022 at 7:34 AM Neal Becker ndbecker2@gmail.com wrote:
I know this is a bit OT, but you guys are great at answering all questions.
I bought a workstation from Titan computers around 1/2020 (dual EPYC cpu). After about 1 year it stopped working. I could ssh to it, and almost any command would return Input/Output error. Unfortunately journalctl gave input/output error so I can't see logs. cat /proc/partitions did not show any nvme device (the root device) on which the OS was installed.
I replaced the SSD with a samsung 980 pro. Reinstalled fedora. It then worked a few weeks, then the exact same symptoms.
I replaced the SSD with another samsung 980 pro, this time with heatsink. Reinstalled fedora. It worked a few weeks. Then same symptoms.
Then I replaced with a 4th samsung 980 pro, but this time instead of using the M.2 socket I used a pcie-m.2 adapter (in case something was wrong with the m.2 socket). Also added a surge protector outlet for good measure. Reinstalled. Watched the smartctl. No errors. Temperature was always low.
Now it's failed again, exactly same symptoms.
Any ideas?
I remember your other email about a month or so ago and thought it was really strange. Have you tried the drives in another system to confirm they're truly dead?
I would check for BIOS updates just for good measure. Other than that, have you had any communication with Titan about it?
Thanks, Richard _______________________________________________ users mailing list -- users@lists.fedoraproject.org To unsubscribe send an email to users-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure
-- *Those who don't understand recursion are doomed to repeat it* _______________________________________________ users mailing list -- users@lists.fedoraproject.org To unsubscribe send an email to users-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure
-- George N. White III
users mailing list -- users@lists.fedoraproject.org To unsubscribe send an email to users-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure
The 2 crucial bx drives I was losing, I replaced with an older smaller mx drive and that one has been working just fine for a couple of months, thinking about my issue and Neal's issue here is what springs to mind.
So in my case, if mine was a power supply issue, it would have to be that something about the new ssds is excessively sensitive to power or ground loops. The thought of my issue being a power supply issue/sata issue burning the device did occur to me. And that issue I have is heavily reported in the 1-star reviews for the crucial device, several people having more than 1 failure and returning the device for refund. The people that have the failure seem to be able to repeat, and I assume others work just fine. So it would seem that there must be some component used in recent ssd's may be super sensitive to something either power supply wise or sata port wise, or the design has a internal grounding issue and is sensitive to ground loop wise that does not cause an issue with the older devices (I have 2 older SSD's and 8 hard drives that have been running in said machine for months to years just fine). I would think on an NVME device that it would be well grounded to the motherboard/case. In my case my ssds were in a plastic drive holder so the only ground would have been via the sata connection and the power supply, and so if the drive design had components expecting a screw hole ground that won't exist in some cases, and could have floating voltages then that might damage something.
How was your nvme drive mounted in your case? On mine the normal screw holes were not connected to ground (plastic drive case) so the "chassis" of the drive would not have been externally grounded, and had said drive unit chassis not had a direct connect to to power or SATA ground that could end up with floating voltages on the drive chassis and any components tied to it internally.
And ground loops are tricky. I have a wind meter on my roof hooked to a device that counts it's rotations, and that serial port device would randomly stop working requiring a reset of the usb-to-serial communication to get it to function again (I had a cron job to reload/reset the usb nightly because it was happening often enough). I guessed ground loop ran a ground wire to house ground and grounded the hw device doing the counting years ago, and that solved the issue.
On Tue, Feb 22, 2022 at 9:47 AM George N. White III gnwiii@gmail.com wrote:
On Tue, 22 Feb 2022 at 10:04, Neal Becker ndbecker2@gmail.com wrote:
Thanks Richard. Yes, I talked with Titan; they suggested trying the pcie-m.2 adapter. I will try them again. I have not checked for bios updates. Not sure how to go about that (last time I did that it required an msdos floppy disc).
Haven't tried the SSDs in another device because I don't have one. But the fact that replacing the SSD causes it to work, where it wasn't working before, tells me they were damaged. I have at least once power off/on the workstation, and the bios did not find any ssd to boot from. So power cycle didn't fix it, but replace ssd did fix it.
I will try Titan again later today, but just looking for ideas.
With this history, I'd probably replace the workstation power supply. I would also scan the the system board for capacitors on bulging tops or overheated components.
Are there any externally powered devices connected to the workstation (other than the monitor)?
Are you in an area with frequent lightning storms? How stable is your power? Is the system connected to a UPS?
I had a similar experience with spinning disks in a system that contained a drive-bay radio receiver and was connected to a satellite dish and GPS receiver on the roof, and an antenna controller. Everything was powered by a high quality UPS. I added a heavy wire connecting the antenna controller case to the workstation case and the failures stopped.
I gather you now have space for two m.2 SSD's. If you haven't discarded the non-working devices, it would be interesting to see if any are detected and what smartmontools says about them, but you also have the option to put /var on a separate drive. Smartmon tools can monitor a drive and report any problems it detects, but you may also want to run self-tests periodically.
Thanks, Neal
On Tue, Feb 22, 2022 at 8:44 AM Richard Shaw hobbes1069@gmail.com wrote:
On Tue, Feb 22, 2022 at 7:34 AM Neal Becker ndbecker2@gmail.com wrote:
I know this is a bit OT, but you guys are great at answering all questions.
I bought a workstation from Titan computers around 1/2020 (dual EPYC cpu). After about 1 year it stopped working. I could ssh to it, and almost any command would return Input/Output error. Unfortunately journalctl gave input/output error so I can't see logs. cat /proc/partitions did not show any nvme device (the root device) on which the OS was installed.
I replaced the SSD with a samsung 980 pro. Reinstalled fedora. It then worked a few weeks, then the exact same symptoms.
I replaced the SSD with another samsung 980 pro, this time with heatsink. Reinstalled fedora. It worked a few weeks. Then same symptoms.
Then I replaced with a 4th samsung 980 pro, but this time instead of using the M.2 socket I used a pcie-m.2 adapter (in case something was wrong with the m.2 socket). Also added a surge protector outlet for good measure. Reinstalled. Watched the smartctl. No errors. Temperature was always low.
Now it's failed again, exactly same symptoms.
Any ideas?
I remember your other email about a month or so ago and thought it was really strange. Have you tried the drives in another system to confirm they're truly dead?
I would check for BIOS updates just for good measure. Other than that, have you had any communication with Titan about it?
Thanks, Richard _______________________________________________ users mailing list -- users@lists.fedoraproject.org To unsubscribe send an email to users-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure
-- Those who don't understand recursion are doomed to repeat it _______________________________________________ users mailing list -- users@lists.fedoraproject.org To unsubscribe send an email to users-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure
-- George N. White III
users mailing list -- users@lists.fedoraproject.org To unsubscribe send an email to users-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure
On Tue, Feb 22, 2022 at 11:19 AM Neal Becker ndbecker2@gmail.com wrote:
[...] e.g. erasing the partition table on the drive (I don't have another machine handy to verify this), [...]
If it was just erasing the partition table the drive would still be visible using lsblk, and you could re-partition it with fdisk, etc.
You mentioned a surge suppressor strip - any chance it has already suppressed a surge in the past? If so, it might not be functioning as a surge suppressor anymore.
On Tue, 2022-02-22 at 10:31 -0600, Roger Heflin wrote:
I have a wind meter on my roof hooked to a device that counts it's rotations, and that serial port device would randomly stop working requiring a reset of the usb-to-serial communication to get it to function again (I had a cron job to reload/reset the usb nightly because it was happening often enough). I guessed ground loop ran a ground wire to house ground and grounded the hw device doing the counting years ago, and that solved the issue.
That one may have been static build-up. Things with wind blowing over them often have such problems.
Ground loops are where there's a loop formed by several grounds that connect together. When you have faults caused by ungrounded things that go away when you add a ground, that's not a ground loop fault, that's a lack of grounding problem.
If you find that strapping the cabinets of equipment together helps reduce faults, it could be the house mains wiring has an earthing fault (or the wall socket, or power strip, you're using).
You can find with a lot of modern equipment, they don't ground the internals, and it becomes susceptible to static build up and discharge faults.
Mains spikes can cause failure as the modern groundless power supplies will capacitively couple active and neutral to the common rail of the power supplies output. Now, instead of current going to the mains power earth, it goes through the equipment. Even without any spikes, the output floats at a high voltage, and when you connect things together sparks fly between the two (hence the recommendation to hook equipment together *before* plugging the mains power in). That kind of thing was the cause of death of CD player outputs, home video camera DV ports, etc. They didn't like a 400 volt sudden charge through something that only worked with very low voltages.
On Tue, 2022-02-22 at 12:21 -0500, Go Canes wrote:
If it was just erasing the partition table the drive would still be visible using lsblk, and you could re-partition it with fdisk, etc.
I do wonder if the devices were ruined, or just had their data scrambled? Neal didn't say whether he'd tried reformatting the failed ones, I presume he would have, but he just mentioned replacing them.
You mentioned a surge suppressor strip - any chance it has already suppressed a surge in the past? If so, it might not be functioning as a surge suppressor anymore.
Every day your house receives lots of surges. Most you'll never notice. There's not just the times you notice the lights suddenly glowing brighter. There's very fast and large spikes from the continual switching of loads across the grid, and how the stations manage them. Because of this, surge protectors are always working, and will eventually die without you probably being aware of it.
It's worth remembering that a surge protector may not protect your equipment from being wrecked, primarily it should blow a house fuse on a large surge to prevent some equipment catching fire and burning your house down.
If you have noisy mains causing you a problem you want mains filtering, possibly a UPS as well (a constantly running one, like a power conditioner, not a changeover one that supplies raw mains until it kicks in as a backup supply).
For what it's worth, considering my opening paragraph about the mains always has spikes on it all day every day, that's normal. Any equipment that requires external protection has not been built correctly. Anything that plugs directly into the mains should be able to handle what's normally on the mains, for its entire operational life.
The exception I make about that rule is when you want to minimise noise on something that can handle it without damage, but the effect is noticeably annoying and you want to reduce it. But again, the equipment really should have been designed better. Stereo systems, for instance, shouldn't crackle along with mains pops.
On Wed, 23 Feb 2022 at 06:49, Tim via users users@lists.fedoraproject.org wrote:
On Tue, 2022-02-22 at 12:21 -0500, Go Canes wrote:
If it was just erasing the partition table the drive would still be visible using lsblk, and you could re-partition it with fdisk, etc.
I do wonder if the devices were ruined, or just had their data scrambled? Neal didn't say whether he'd tried reformatting the failed ones, I presume he would have, but he just mentioned replacing them.
Various magical incantations do sometimes seem to work:
Solid State Drive SSD not seen in computer bios data recovery repair - YouTube https://www.youtube.com/watch?v=BgddtpSEhQQ
Fix your dead SSD with the power cycle method - The Silicon Underground (dfarq.homeip.net) https://dfarq.homeip.net/fix-dead-ssd/
fmadio | Recover Bricked SSD with JTAG https://fmad.io/blog-ssd-bricked-restore.html
You mentioned a surge suppressor strip - any chance it has already suppressed a surge in the past? If so, it might not be functioning as a surge suppressor anymore.
Every day your house receives lots of surges. Most you'll never notice. There's not just the times you notice the lights suddenly glowing brighter. There's very fast and large spikes from the continual switching of loads across the grid, and how the stations manage them. Because of this, surge protectors are always working, and will eventually die without you probably being aware of it.
I used to add surge protection to power bars. We had a tree fall on the cable coax that they had installed without a proper anchor, just a zip tie to the mast for the AC power. The mast was pulled off the house, which meant the neutral line got disconnected first, and lightbulbs went off like flash bulbs. My Victor 9000 PC with the home-made surge protection survived, but we lost the doorbell transformer and a radio (and the thyristers in the surge protector).
My father in law also used home-made surge protectors. A truck hit a high-voltage distribution line, causing the high voltage wires to fall onto the lower voltage feed to the house. His computer survived, but some appliances died.
It's worth remembering that a surge protector may not protect your equipment from being wrecked, primarily it should blow a house fuse on a large surge to prevent some equipment catching fire and burning your house down.
My power bar had a circuit breaker which did trip.
If you have noisy mains causing you a problem you want mains filtering, possibly a UPS as well (a constantly running one, like a power conditioner, not a changeover one that supplies raw mains until it kicks in as a backup supply).
For what it's worth, considering my opening paragraph about the mains always has spikes on it all day every day, that's normal. Any equipment that requires external protection has not been built correctly. Anything that plugs directly into the mains should be able to handle what's normally on the mains, for its entire operational life.
I grew up in an area with frequent power outages. My parents were careful to unplug computers when not in use. We often put an overhand knot in power cords. One day I had just turned on an electric kettle when lightning hit near the house. The kettle had a metal base with a hole where the cord passed thru and a 2-conductor cord. The cord burned off at the hole -- I assume an induced current took a shortcut from one conductor to the other. Nothing else was damaged.
The exception I make about that rule is when you want to minimise noise on something that can handle it without damage, but the effect is noticeably annoying and you want to reduce it. But again, the equipment really should have been designed better. Stereo systems, for instance, shouldn't crackle along with mains pops.
At work we had a small machine room with a window in the door. One day during a storm I was walking past the room when lightning hit the building. I saw a bright trail down the corner of the room and across one of the SGI Octane workstations. Those systems had a heavy metal chassis under a plastic cover, but there was a light bar outside the chassis. One of the incandescent bulbs in the light bar burned out, but that was the only damage.
Better designed equipment costs more. I generally try to buy gear that has been on the market a couple years, which gives time for drivers to make it into the kernel and for design flaws to be noticed. Vendors often reduce prices just before introducing new models. I once scored a PowerEdge server with full complement of ECC memory for the price of the memory days before a newer model was announced.
OK, status update.
1. The workstation is located in a offices of a large satellite ISP. Clean power is probably not an issue.
2. Attempt to reboot machine. BIOS boot options does not show the M.2 SSD existing.
3. Boot F35 from USB live. Go to install to disk. SSD is not shown as an option for installation. (I didn't try lsblk, but I'm sure it would have shown the SSD didn't exist).
4. Install SATA SSD drive. That was fun, I didn't know how to get to the drive bays and didn't have screws to mount it, so used double backed sticky tape.
5. Again BIOS boot options. Oh look, F35 samsung pro 980 is back! My best guess is that some part has a thermal issue and while I installed the SATA drive it had cooled down?? Doesn't really explain that the same symptoms occurred both with a SSD plugged into the MB M.2 socket and when I got a pcie-m.2 adapter and plugged it in there, since there would have been 2 different controller chips (I guess).
6. Anyway I perform install to sata drive. Everything is fine (for now).
7. The m.2 ssd that didn't exist previously is still plugged. I can mount it; everything seems fine. Run smartmonctl -a /dev/nvme0. No errors recorded. No problems. 100% spare.
I'm still baffled.
On Wed, Feb 23, 2022 at 9:42 AM George N. White III gnwiii@gmail.com wrote:
On Wed, 23 Feb 2022 at 06:49, Tim via users users@lists.fedoraproject.org wrote:
On Tue, 2022-02-22 at 12:21 -0500, Go Canes wrote:
If it was just erasing the partition table the drive would still be visible using lsblk, and you could re-partition it with fdisk, etc.
I do wonder if the devices were ruined, or just had their data scrambled? Neal didn't say whether he'd tried reformatting the failed ones, I presume he would have, but he just mentioned replacing them.
Various magical incantations do sometimes seem to work:
Solid State Drive SSD not seen in computer bios data recovery repair - YouTube https://www.youtube.com/watch?v=BgddtpSEhQQ
Fix your dead SSD with the power cycle method - The Silicon Underground (dfarq.homeip.net) https://dfarq.homeip.net/fix-dead-ssd/
fmadio | Recover Bricked SSD with JTAG https://fmad.io/blog-ssd-bricked-restore.html
You mentioned a surge suppressor strip - any chance it has already suppressed a surge in the past? If so, it might not be functioning as a surge suppressor anymore.
Every day your house receives lots of surges. Most you'll never notice. There's not just the times you notice the lights suddenly glowing brighter. There's very fast and large spikes from the continual switching of loads across the grid, and how the stations manage them. Because of this, surge protectors are always working, and will eventually die without you probably being aware of it.
I used to add surge protection to power bars. We had a tree fall on the cable coax that they had installed without a proper anchor, just a zip tie to the mast for the AC power. The mast was pulled off the house, which meant the neutral line got disconnected first, and lightbulbs went off like flash bulbs. My Victor 9000 PC with the home-made surge protection survived, but we lost the doorbell transformer and a radio (and the thyristers in the surge protector).
My father in law also used home-made surge protectors. A truck hit a high-voltage distribution line, causing the high voltage wires to fall onto the lower voltage feed to the house. His computer survived, but some appliances died.
It's worth remembering that a surge protector may not protect your equipment from being wrecked, primarily it should blow a house fuse on a large surge to prevent some equipment catching fire and burning your house down.
My power bar had a circuit breaker which did trip.
If you have noisy mains causing you a problem you want mains filtering, possibly a UPS as well (a constantly running one, like a power conditioner, not a changeover one that supplies raw mains until it kicks in as a backup supply).
For what it's worth, considering my opening paragraph about the mains always has spikes on it all day every day, that's normal. Any equipment that requires external protection has not been built correctly. Anything that plugs directly into the mains should be able to handle what's normally on the mains, for its entire operational life.
I grew up in an area with frequent power outages. My parents were careful to unplug computers when not in use. We often put an overhand knot in power cords. One day I had just turned on an electric kettle when lightning hit near the house. The kettle had a metal base with a hole where the cord passed thru and a 2-conductor cord. The cord burned off at the hole -- I assume an induced current took a shortcut from one conductor to the other. Nothing else was damaged.
The exception I make about that rule is when you want to minimise noise on something that can handle it without damage, but the effect is noticeably annoying and you want to reduce it. But again, the equipment really should have been designed better. Stereo systems, for instance, shouldn't crackle along with mains pops.
At work we had a small machine room with a window in the door. One day during a storm I was walking past the room when lightning hit the building. I saw a bright trail down the corner of the room and across one of the SGI Octane workstations. Those systems had a heavy metal chassis under a plastic cover, but there was a light bar outside the chassis. One of the incandescent bulbs in the light bar burned out, but that was the only damage.
Better designed equipment costs more. I generally try to buy gear that has been on the market a couple years, which gives time for drivers to make it into the kernel and for design flaws to be noticed. Vendors often reduce prices just before introducing new models. I once scored a PowerEdge server with full complement of ECC memory for the price of the memory days before a newer model was announced.
-- George N. White III
users mailing list -- users@lists.fedoraproject.org To unsubscribe send an email to users-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure
On Wed, 23 Feb 2022 at 12:20, Neal Becker ndbecker2@gmail.com wrote:
OK, status update.
- The workstation is located in a offices of a large satellite ISP.
Clean power is probably not an issue.
- Attempt to reboot machine. BIOS boot options does not show the M.2 SSD
existing.
- Boot F35 from USB live. Go to install to disk. SSD is not shown as an
option for installation. (I didn't try lsblk, but I'm sure it would have shown the SSD didn't exist).
- Install SATA SSD drive. That was fun, I didn't know how to get to the
drive bays and didn't have screws to mount it, so used double backed sticky tape.
- Again BIOS boot options. Oh look, F35 samsung pro 980 is back! My
best guess is that some part has a thermal issue and while I installed the SATA drive it had cooled down?? Doesn't really explain that the same symptoms occurred both with a SSD plugged into the MB M.2 socket and when I got a pcie-m.2 adapter and plugged it in there, since there would have been 2 different controller chips (I guess).
Anyway I perform install to sata drive. Everything is fine (for now).
The m.2 ssd that didn't exist previously is still plugged. I can mount
it; everything seems fine. Run smartmonctl -a /dev/nvme0. No errors recorded. No problems. 100% spare.
I'm still baffled.
Some articles mention drives talking very long times to respond while running internal "grooming" -- I presume moving data off areas with high "wear" to areas of lower wear. There are also reports that frantically moving the drive around to different external adapters eventually allows drive to mount. All could be explained by drive taking too long to respond while grooming is in progress. Once grooming is finished, drive works normally.
On Wed, Feb 23, 2022 at 9:42 AM George N. White III gnwiii@gmail.com wrote:
On Wed, 23 Feb 2022 at 06:49, Tim via users < users@lists.fedoraproject.org> wrote:
On Tue, 2022-02-22 at 12:21 -0500, Go Canes wrote:
If it was just erasing the partition table the drive would still be visible using lsblk, and you could re-partition it with fdisk, etc.
I do wonder if the devices were ruined, or just had their data scrambled? Neal didn't say whether he'd tried reformatting the failed ones, I presume he would have, but he just mentioned replacing them.
Various magical incantations do sometimes seem to work:
Solid State Drive SSD not seen in computer bios data recovery repair - YouTube https://www.youtube.com/watch?v=BgddtpSEhQQ
Fix your dead SSD with the power cycle method - The Silicon Underground (dfarq.homeip.net) https://dfarq.homeip.net/fix-dead-ssd/
fmadio | Recover Bricked SSD with JTAG https://fmad.io/blog-ssd-bricked-restore.html
You mentioned a surge suppressor strip - any chance it has already suppressed a surge in the past? If so, it might not be functioning as a surge suppressor anymore.
Every day your house receives lots of surges. Most you'll never notice. There's not just the times you notice the lights suddenly glowing brighter. There's very fast and large spikes from the continual switching of loads across the grid, and how the stations manage them. Because of this, surge protectors are always working, and will eventually die without you probably being aware of it.
I used to add surge protection to power bars. We had a tree fall on the cable coax that they had installed without a proper anchor, just a zip tie to the mast for the AC power. The mast was pulled off the house, which meant the neutral line got disconnected first, and lightbulbs went off like flash bulbs. My Victor 9000 PC with the home-made surge protection survived, but we lost the doorbell transformer and a radio (and the thyristers in the surge protector).
My father in law also used home-made surge protectors. A truck hit a high-voltage distribution line, causing the high voltage wires to fall onto the lower voltage feed to the house. His computer survived, but some appliances died.
It's worth remembering that a surge protector may not protect your equipment from being wrecked, primarily it should blow a house fuse on a large surge to prevent some equipment catching fire and burning your house down.
My power bar had a circuit breaker which did trip.
If you have noisy mains causing you a problem you want mains filtering, possibly a UPS as well (a constantly running one, like a power conditioner, not a changeover one that supplies raw mains until it kicks in as a backup supply).
For what it's worth, considering my opening paragraph about the mains always has spikes on it all day every day, that's normal. Any equipment that requires external protection has not been built correctly. Anything that plugs directly into the mains should be able to handle what's normally on the mains, for its entire operational life.
I grew up in an area with frequent power outages. My parents were careful to unplug computers when not in use. We often put an overhand knot in power cords. One day I had just turned on an electric kettle when lightning hit near the house. The kettle had a metal base with a hole where the cord passed thru and a 2-conductor cord. The cord burned off at the hole -- I assume an induced current took a shortcut from one conductor to the other. Nothing else was damaged.
The exception I make about that rule is when you want to minimise noise on something that can handle it without damage, but the effect is noticeably annoying and you want to reduce it. But again, the equipment really should have been designed better. Stereo systems, for instance, shouldn't crackle along with mains pops.
At work we had a small machine room with a window in the door. One day during a storm I was walking past the room when lightning hit the building. I saw a bright trail down the corner of the room and across one of the SGI Octane workstations. Those systems had a heavy metal chassis under a plastic cover, but there was a light bar outside the chassis. One of the incandescent bulbs in the light bar burned out, but that was the only damage.
Better designed equipment costs more. I generally try to buy gear that has been on the market a couple years, which gives time for drivers to make it into the kernel and for design flaws to be noticed. Vendors often reduce prices just before introducing new models. I once scored a PowerEdge server with full complement of ECC memory for the price of the memory days before a newer model was announced.
-- George N. White III
users mailing list -- users@lists.fedoraproject.org To unsubscribe send an email to users-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure
-- *Those who don't understand recursion are doomed to repeat it* _______________________________________________ users mailing list -- users@lists.fedoraproject.org To unsubscribe send an email to users-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure
On Tue, 2022-02-22 at 08:34 -0500, Neal Becker wrote:
I replaced the SSD with a samsung 980 pro. Reinstalled fedora. It then worked a few weeks, then the exact same symptoms.
Rinse, lather, repeat...
A thought just occurred to me, did you take proper (*) anti-static precautions when handling the drives?
Remembering back to my days in college studying electronics, one thing drummed into us was proper handling of components. Static shock (charge and/or discharge) doesn't always immediately kill a product, but it usually weakens it and shortens its lifespan.
Ages after some device fails you're never going to attribute failure to how you built it, but laboratory testing confirms this kind of failure.
We were told that might mean something only lasts one year instead of a decade, or a decade instead of your lifespan. While you might think 10 years of product life is adequate, there's plenty of situations where that is not, and I might say you've got accustomed to the low standards of modern manufacturing.
And it's not impossible that mishandling could result in something only lasting a couple of weeks.
It doesn't have to be you, either. For instance, if the supplier is in the habit of walking across the carpet with drives outside of anti- static protection (or even inside anti-static bags, if they happen to generate enough static). Or moving things generating static past where they store the drives. Or destroying them in transit. /They/ could be the cause of your problems, and I'd be inclined to try another supplier when you've had several strange failures in a row.
* While "proper" precautions traditionally meant anti-static floor matting, high resistance desk matting, wrist straps, etc, it's not essential to do things like you're in a research laboratory.
Doing things like: Putting all your components on the desk, still bagged, with the computer, so they all reach the same potential slowly. Sitting down and staying seated while you open the bags. Keeping in contact with the desk or the PC as you pick up things and fit them to the PC. If you have to get up in the middle of the job, you keep your hand on the desk when you sit down again before touching anything else. If you work standing up, you lean against the desk the entire time. Staying in contact with your work area is easier than wearing wrist straps, and safer if you have to step away from the desk and forget to unplug (you can yank things off the desk), and safer for those of us who work with personally dangerous voltages (not being attached means you can back off and not take it with you). Goes a long way to ensuring that no large static differences build up between the things then suddenly discharge when they come into contact with each other.
On Wed, 2022-02-23 at 10:41 -0400, George N. White III wrote:
I used to add surge protection to power bars. We had a tree fall on the cable coax that they had installed without a proper anchor, just a zip tie to the mast for the AC power. The mast was pulled off the house, which meant the neutral line got disconnected first, and lightbulbs went off like flash bulbs. My Victor 9000 PC with the home-made surge protection survived, but we lost the doorbell transformer and a radio (and the thyristers in the surge protector).
I've always been a bit wary of ones in power boards (the multi-socket adaptors on short leads. Ideally surge protection should be at the distribution box (to protect the whole house from surges from the street). You really want the house to disconnect from power under dangerous conditions, not just a board with potentially (now) hazardous wiring still energised between it and the wall.
If you pull apart power boards, you often notice how thin the wiring is, and the metal strips used to form the sockets. You've got a good chance at weakening or blowing the board instead of the main fuse. Even under good wiring circumstances the fuses and breakers may not be quick enough to break the circuit.
Some of the boards use inappropriate over-voltage clamping that's always being driven warm by triggering too close to the normal mains voltage (possibly this is manufacturers of 110 volt equipment selling their products into 240 volt markets). I've come across some that are always too warm for my liking, and wouldn't like them being buried and out of sight. I've seen them discoloured and deformed plastic.
People have a habit of daisy chaining power boards. EMI filters and surge protection at the end of the line puts heavier workload on ones nearer to the socket, and all the joins between. Advice was that if you have to have more outlets than available on one board, the first one plugged into the wall should be the one with filters and protection, then plug other plain unprotected boards directly into it, rather than string a series of boards through each other. The first board protects the rest, rather than the rest stressing out everything.
Well the fact that the failed ssd is working again tells me it wasn't zapped by static or power transients. The comment about taking a long time for the ssd to reorganize itself is interesting, but here it failed 1 day, and I went to fix it the next day, where it still was not detected by F35 live usb stick. Came back to life after I installed the sata drive.
The only thing I can think of is a motherboard component that fails when warm, came back to life after it cooled while I installed the sata? But since the same failure was seen with SSD plugged into m.2 socket on motherboard, and different SSD plugged into m.2 socket on pcie->m.2 adapter; I don't think there is any common component? Don't know enough about these motherboard architecture to say for sure. Must be something in common though to explain these symptoms.
On Wed, Feb 23, 2022 at 8:20 PM Tim via users users@lists.fedoraproject.org wrote:
On Wed, 2022-02-23 at 10:41 -0400, George N. White III wrote:
I used to add surge protection to power bars. We had a tree fall on the cable coax that they had installed without a proper anchor, just a zip tie to the mast for the AC power. The mast was pulled off the house, which meant the neutral line got disconnected first, and lightbulbs went off like flash bulbs. My Victor 9000 PC with the home-made surge protection survived, but we lost the doorbell transformer and a radio (and the thyristers in the surge protector).
I've always been a bit wary of ones in power boards (the multi-socket adaptors on short leads. Ideally surge protection should be at the distribution box (to protect the whole house from surges from the street). You really want the house to disconnect from power under dangerous conditions, not just a board with potentially (now) hazardous wiring still energised between it and the wall.
If you pull apart power boards, you often notice how thin the wiring is, and the metal strips used to form the sockets. You've got a good chance at weakening or blowing the board instead of the main fuse. Even under good wiring circumstances the fuses and breakers may not be quick enough to break the circuit.
Some of the boards use inappropriate over-voltage clamping that's always being driven warm by triggering too close to the normal mains voltage (possibly this is manufacturers of 110 volt equipment selling their products into 240 volt markets). I've come across some that are always too warm for my liking, and wouldn't like them being buried and out of sight. I've seen them discoloured and deformed plastic.
People have a habit of daisy chaining power boards. EMI filters and surge protection at the end of the line puts heavier workload on ones nearer to the socket, and all the joins between. Advice was that if you have to have more outlets than available on one board, the first one plugged into the wall should be the one with filters and protection, then plug other plain unprotected boards directly into it, rather than string a series of boards through each other. The first board protects the rest, rather than the rest stressing out everything.
--
uname -rsvp Linux 5.11.22-100.fc32.x86_64 #1 SMP Wed May 19 18:58:25 UTC 2021 x86_64
Boilerplate: All unexpected mail to my mailbox is automatically deleted. I will only get to see the messages that are posted to the mailing list.
users mailing list -- users@lists.fedoraproject.org To unsubscribe send an email to users-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure
On Thu, 2022-02-24 at 07:42 -0500, Neal Becker wrote:
Well the fact that the failed ssd is working again tells me it wasn't zapped by static or power transients. The comment about taking a long time for the ssd to reorganize itself is interesting, but here it failed 1 day, and I went to fix it the next day, where it still was not detected by F35 live usb stick. Came back to life after I installed the sata drive.
The only thing I can think of is a motherboard component that fails when warm, came back to life after it cooled while I installed the sata? But since the same failure was seen with SSD plugged into m.2 socket on motherboard, and different SSD plugged into m.2 socket on pcie->m.2 adapter; I don't think there is any common component? Don't know enough about these motherboard architecture to say for sure. Must be something in common though to explain these symptoms.
Is cooling working properly? No clogged fins, heatsinks seated well, fan speeds go faster when things heat up?
If the drive went unusable due to scrambled data, perhaps reseat RAM and CPU (and any other plug & socket connection) and do a memory check.
Power dips, not just spikes, at inopportune moments can cause random behaviour that might mess things up. Do you use a UPS?
Did you fully pull power and/or turn off power at the power supply prior to installing the sata drive? Ie was the sata drive install the first time the machine was actually off and not just quickly rebooted/reset?
If that was the first full power off it would be likely that the drive was in some sort of internal firmware loop and until power was fully removed it never actually was rebooted and/or reset..
I had a spinning sata disk that I was forcing badblocks in an attempt to clean it up to be usable, and 1 out of 50 times or so when it was handling the bad blocks it would stop responding, and a simple reboot never fixed it, but a turn off power for a few seconds fixed it every time.
On Thu, Feb 24, 2022 at 6:42 AM Neal Becker ndbecker2@gmail.com wrote:
Well the fact that the failed ssd is working again tells me it wasn't zapped by static or power transients. The comment about taking a long time for the ssd to reorganize itself is interesting, but here it failed 1 day, and I went to fix it the next day, where it still was not detected by F35 live usb stick. Came back to life after I installed the sata drive.
The only thing I can think of is a motherboard component that fails when warm, came back to life after it cooled while I installed the sata? But since the same failure was seen with SSD plugged into m.2 socket on motherboard, and different SSD plugged into m.2 socket on pcie->m.2 adapter; I don't think there is any common component? Don't know enough about these motherboard architecture to say for sure. Must be something in common though to explain these symptoms.
On Wed, Feb 23, 2022 at 8:20 PM Tim via users users@lists.fedoraproject.org wrote:
On Wed, 2022-02-23 at 10:41 -0400, George N. White III wrote:
I used to add surge protection to power bars. We had a tree fall on the cable coax that they had installed without a proper anchor, just a zip tie to the mast for the AC power. The mast was pulled off the house, which meant the neutral line got disconnected first, and lightbulbs went off like flash bulbs. My Victor 9000 PC with the home-made surge protection survived, but we lost the doorbell transformer and a radio (and the thyristers in the surge protector).
I've always been a bit wary of ones in power boards (the multi-socket adaptors on short leads. Ideally surge protection should be at the distribution box (to protect the whole house from surges from the street). You really want the house to disconnect from power under dangerous conditions, not just a board with potentially (now) hazardous wiring still energised between it and the wall.
If you pull apart power boards, you often notice how thin the wiring is, and the metal strips used to form the sockets. You've got a good chance at weakening or blowing the board instead of the main fuse. Even under good wiring circumstances the fuses and breakers may not be quick enough to break the circuit.
Some of the boards use inappropriate over-voltage clamping that's always being driven warm by triggering too close to the normal mains voltage (possibly this is manufacturers of 110 volt equipment selling their products into 240 volt markets). I've come across some that are always too warm for my liking, and wouldn't like them being buried and out of sight. I've seen them discoloured and deformed plastic.
People have a habit of daisy chaining power boards. EMI filters and surge protection at the end of the line puts heavier workload on ones nearer to the socket, and all the joins between. Advice was that if you have to have more outlets than available on one board, the first one plugged into the wall should be the one with filters and protection, then plug other plain unprotected boards directly into it, rather than string a series of boards through each other. The first board protects the rest, rather than the rest stressing out everything.
--
uname -rsvp Linux 5.11.22-100.fc32.x86_64 #1 SMP Wed May 19 18:58:25 UTC 2021 x86_64
Boilerplate: All unexpected mail to my mailbox is automatically deleted. I will only get to see the messages that are posted to the mailing list.
users mailing list -- users@lists.fedoraproject.org To unsubscribe send an email to users-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure
-- Those who don't understand recursion are doomed to repeat it _______________________________________________ users mailing list -- users@lists.fedoraproject.org To unsubscribe send an email to users-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure
On 2022-02-24 10:21, Roger Heflin wrote:
Did you fully pull power and/or turn off power at the power supply prior to installing the sata drive? Ie was the sata drive install the first time the machine was actually off and not just quickly rebooted/reset?
If that was the first full power off it would be likely that the drive was in some sort of internal firmware loop and until power was fully removed it never actually was rebooted and/or reset..
I had a spinning sata disk that I was forcing badblocks in an attempt to clean it up to be usable, and 1 out of 50 times or so when it was handling the bad blocks it would stop responding, and a simple reboot never fixed it, but a turn off power for a few seconds fixed it every time.
On Thu, Feb 24, 2022 at 6:42 AM Neal Becker ndbecker2@gmail.com wrote:
Well the fact that the failed ssd is working again tells me it wasn't zapped by static or power transients. The comment about taking a long time for the ssd to reorganize itself is interesting, but here it failed 1 day, and I went to fix it the next day, where it still was not detected by F35 live usb stick. Came back to life after I installed the sata drive.
The only thing I can think of is a motherboard component that fails when warm, came back to life after it cooled while I installed the sata? But since the same failure was seen with SSD plugged into m.2 socket on motherboard, and different SSD plugged into m.2 socket on pcie->m.2 adapter; I don't think there is any common component? Don't know enough about these motherboard architecture to say for sure. Must be something in common though to explain these symptoms.
On Wed, Feb 23, 2022 at 8:20 PM Tim via users users@lists.fedoraproject.org wrote:
On Wed, 2022-02-23 at 10:41 -0400, George N. White III wrote:
I used to add surge protection to power bars. We had a tree fall on the cable coax that they had installed without a proper anchor, just a zip tie to the mast for the AC power. The mast was pulled off the house, which meant the neutral line got disconnected first, and lightbulbs went off like flash bulbs. My Victor 9000 PC with the home-made surge protection survived, but we lost the doorbell transformer and a radio (and the thyristers in the surge protector).
I've always been a bit wary of ones in power boards (the multi-socket adaptors on short leads. Ideally surge protection should be at the distribution box (to protect the whole house from surges from the street). You really want the house to disconnect from power under dangerous conditions, not just a board with potentially (now) hazardous wiring still energised between it and the wall.
If you pull apart power boards, you often notice how thin the wiring is, and the metal strips used to form the sockets. You've got a good chance at weakening or blowing the board instead of the main fuse. Even under good wiring circumstances the fuses and breakers may not be quick enough to break the circuit.
Some of the boards use inappropriate over-voltage clamping that's always being driven warm by triggering too close to the normal mains voltage (possibly this is manufacturers of 110 volt equipment selling their products into 240 volt markets). I've come across some that are always too warm for my liking, and wouldn't like them being buried and out of sight. I've seen them discoloured and deformed plastic.
People have a habit of daisy chaining power boards. EMI filters and surge protection at the end of the line puts heavier workload on ones nearer to the socket, and all the joins between. Advice was that if you have to have more outlets than available on one board, the first one plugged into the wall should be the one with filters and protection, then plug other plain unprotected boards directly into it, rather than string a series of boards through each other. The first board protects the rest, rather than the rest stressing out everything.
These symptoms look an awful lot like a failing power supply. Do you have a BIOS page that can show you the voltages and whether they are in spec? Its also possible that you have a bad stick of memory or a bad motherboard. I had a spinning disk that was corrupted every other day, and I finally threw the machine out because it was something on the motherboard. Its cheaper to pick up a refurb machine at Staples or your local computer store than it is to start swapping parts to identify the real problem.
--
John Mellor
On Thu, 24 Feb 2022 10:35:11 -0500 John Mellor wrote:
I finally threw the machine out
Yep. I had a cursed computer that something different broke on every few days. I eventually dumped it in the recycle bin at the local electronics recycling place and built a whole new system. It was impossible to track down what was really wrong with the cursed one (for all I know it really was a curse :-).
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * It would be good if people replying did not quote the entire * * prior message, especially all the bits that have absolutely * * nothing to do with the reply that they're writing. * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
On Thu, 2022-02-24 at 10:35 -0500, John Mellor wrote:
I had a spinning disk that was corrupted every other day, and I finally threw the machine out because it was something on the motherboard.
Many years ago I had an overheating hard drive that would corrupt its data, putting a fan on it made it behave nicely. I'd be less inclined to think a SDD would overheat, but it's not impossible, especially in compact cases with bad ventilation, and in hot countries.
And not related... I've have a keyboard go wonky in my Dell laptop ever since I stressed it out running Folding @ Home for a year. It was running ... hot... and the thermal stress has caused a permanent, intermittent flakiness.
On Fri, 2022-02-25 at 02:35 +1030, Tim via users wrote:
- It would be good if people replying did not quote the entire *
- prior message, especially all the bits that have absolutely *
- nothing to do with the reply that they're writing. *
+1
(not to mention the increasing use of top-posting).
poc
On Thu, 24 Feb 2022 at 09:07, Tim via users users@lists.fedoraproject.org wrote:
On Thu, 2022-02-24 at 07:42 -0500, Neal Becker wrote:
Well the fact that the failed ssd is working again tells me it wasn't zapped by static or power transients. The comment about taking a long time for the ssd to reorganize itself is interesting, but here it failed 1 day, and I went to fix it the next day, where it still was not detected by F35 live usb stick. Came back to life after I installed the sata drive.
This is similar to the pattern reported on other forums. Drive is appears to be dead, after trying various external adapters it starts working, so we don't know if there is something special about a particular adapter or a change in the internal status of the drive. One recovery recipe is to apply power without the data connection. This could allow the drive to get its act together without the diversion of commands on the data connection. It makes we wonder if the drive doesn't have whoops mode where it reinstalls firmware and rescans the drive to get its own internal mapping into a consistent state. I assume vendors try to design this (if it even is a real things) in a way that most users won't notice, but there may well be edge case usage patterns that the vendor didn't anticipate.
On Thu, 2022-02-24 at 12:55 -0400, George N. White III wrote:
It makes we wonder if the drive doesn't have whoops mode where it reinstalls firmware and rescans the drive to get its own internal mapping into a consistent state. I assume vendors try to design this (if it even is a real things) in a way that most users won't notice, but there may well be edge case usage patterns that the vendor didn't anticipate.
Like a self-check and repair when idle, but Linux never leaves the drive idle? (Thinking back to the days of the "Linux destroys hard drives" story, where the drive would unload and reload every few seconds, all the time, unless you issued a hdparm command to turn off power saving mode.)
On Thu, 2022-02-24 at 11:26 -0500, Fulko Hew wrote:
And not related... I've have a keyboard go wonky in my Dell laptop ever since I stressed it out running Folding @ Home for a year. It was running ... hot... and the thermal stress has caused a permanent, intermittent flakiness.
Not too surprised. My background's electronics engineering and servicing. Any solid state component that runs too hot to keep your fingers on is really stressing its tolerance. Not just itself, but also its soldering to the board. And anything else its close enough to heat up.
Even warm to the touch is going to mean shortened life for things like capacitors.
Endgame:
So now the system is running from the sata ssd drive, but the nvme ssd is plugged into the M.2->pcie adapter with the failed ssd now working again. So I ask, what's the difference in speed between the nvme ssd plugged into the pcie and the sata? I run bonnie++ on the nvme and then: Feb 24 16:59:11 nbecker8 kernel: nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
Oh, look at that! Just as I suspected, the nvme controller fails when heavily used. The thing that surprised me is that I would have thought the nvme controller which just went down would have been on the pcie interface card. But now it seems that the nvme controller is on the motherboard and even when the nvme is plugged into the pcie the same motherboard nvme controller is still being used. That's the only explanation I can think of.
Sorry for the top posting and excessive quoting: I've given up on running my own mail server and am using gmail, which makes it too easy to follow bad habits.
On Fri, 2022-02-25 at 06:59 -0500, Neal Becker wrote:
Sorry for the top posting and excessive quoting: I've given up on running my own mail server and am using gmail, which makes it too easy to follow bad habits.
I also use Gmail, but for list messages I access it through Evolution. The Gmail web interface and apps are not good with mailing lists, but you don't have to live with that.
(And since you mention it, you didn't quote anything in your last message, which is not really recommended either).
poc
On Fri, Feb 25, 2022 at 6:20 AM Patrick O'Callaghan pocallaghan@gmail.com wrote:
On Fri, 2022-02-25 at 06:59 -0500, Neal Becker wrote:
Sorry for the top posting and excessive quoting: I've given up on running my own mail server and am using gmail, which makes it too easy to follow bad habits.
I also use Gmail, but for list messages I access it through Evolution. The Gmail web interface and apps are not good with mailing lists, but you don't have to live with that.
I've used Gmail exclusively for all the mailing lists I'm on and it doesn't bother me anymore but I have some automatic keystrokes :)
Click reply Cntl-a (unhides everything) Cntl-Home, delete, delete (goes to the top, deletes extra lines on top)
Then I just navigate through and reply to what I want and delete what's unneeded (shift-up/down, delete).
Works well enough for me.
Thanks, Richard
On Fri, 2022-02-25 at 12:19 +0000, Patrick O'Callaghan wrote:
(And since you mention it, you didn't quote anything in your last message, which is not really recommended either).
Better that, than the mass quoting.
Another option is just word your reply so it makes sense on its own.
On Fri, 2022-02-25 at 06:59 -0500, Neal Becker wrote:
Oh, look at that! Just as I suspected, the nvme controller fails when heavily used. The thing that surprised me is that I would have thought the nvme controller which just went down would have been on the pcie interface card. But now it seems that the nvme controller is on the motherboard and even when the nvme is plugged into the pcie the same motherboard nvme controller is still being used. That's the only explanation I can think of.
That kind of thing is probably something we should have asked, before: What else is plugged in?
On various motherboards when you use one kind of special drive interface you lose another port. Which could be a hard either/or usage of only one port, or perhaps just some weird failure mode. Mine does an either/or between you can use Ultra M.2 socket or one of the SATA ports.
Or, there's speed reductions on one interface when one of the others is used. Mine runs the Ultra M.2 socket slower if it's being used in PCIe mode, but not SATA3 mode, when PCIE slot 2 is being used.
It's maddening to buy a board with multiple connectors to be told you can only use some of them at a time. Imagine buying a board with three PCIE slots only to be told you can't plug in three cards, they're just there so you can position one card in the most convenient place.
On Fri, 25 Feb 2022 at 08:00, Neal Becker ndbecker2@gmail.com wrote:
Endgame:
So now the system is running from the sata ssd drive, but the nvme ssd is plugged into the M.2->pcie adapter with the failed ssd now working again. So I ask, what's the difference in speed between the nvme ssd plugged into the pcie and the sata? I run bonnie++ on the nvme and then: Feb 24 16:59:11 nbecker8 kernel: nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
Oh, look at that! Just as I suspected, the nvme controller fails when heavily used. The thing that surprised me is that I would have thought the nvme controller which just went down would have been on the pcie interface card. But now it seems that the nvme controller is on the motherboard and even when the nvme is plugged into the pcie the same motherboard nvme controller is still being used. That's the only explanation I can think of.
The NVME controller is the big chip on the SSD that gets hot. It has a PCIe controller and a CPU. Google the above kernel message, you should get to: < https://bugzilla.kernel.org/show_bug.cgi?id=195039%3E
On my Fedora box with a Samsung NVMe:
% sudo lspci | grep NVMe 03:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983
The bugzilla thread is long and has lots of detail. One post mentions connector issues. Auto parts stores sell "contact enhancer" that is used on connectors in recent model highly computerized cars.
On Fri, 2022-02-25 at 07:09 -0600, Richard Shaw wrote:
I also use Gmail, but for list messages I access it through Evolution. The Gmail web interface and apps are not good with mailing lists, but you don't have to live with that.
I've used Gmail exclusively for all the mailing lists I'm on and it doesn't bother me anymore but I have some automatic keystrokes :)
Click reply Cntl-a (unhides everything) Cntl-Home, delete, delete (goes to the top, deletes extra lines on top)
It's easier to just mark the passage of interest before hitting Reply (in most desktop MUAs), thus getting that passage quoted and the cursor placed underneath. The Gmail web interface actually used to have an option to do this, under the Lab extensions, but it was sadly removed a few years ago.
poc