Well I suspected the PS, but the guy I spoke with at Titan said some other
things would fail before the SSD if that was the problem.
The power should be pretty stable, and I did connect to a good transient
suppressor strip. Anyway there was no lightning when it died, which was in
the past 24 hours. I had been watching smartmon every few days and it
showed no error and temps <37C.
Titan has suggested installing a sata SSD (eliminate the m.2) and I'm going
to try that. He suggested it might be a software issue, that something
might be e.g. erasing the partition table on the drive (I don't have
another machine handy to verify this), but this seems really unlikely. I
just installed F35 and a moderate set of scientific packages, no
proprietary software. The only access in via ssh inside of vpn and I have
the only account.
On Tue, Feb 22, 2022 at 10:47 AM George N. White III <gnwiii(a)gmail.com>
wrote:
On Tue, 22 Feb 2022 at 10:04, Neal Becker <ndbecker2(a)gmail.com>
wrote:
> Thanks Richard. Yes, I talked with Titan; they suggested trying the
> pcie-m.2 adapter. I will try them again.
> I have not checked for bios updates. Not sure how to go about that (last
> time I did that it required an msdos floppy disc).
>
> Haven't tried the SSDs in another device because I don't have one. But
> the fact that replacing the SSD causes it to work, where it wasn't working
> before, tells me they were damaged. I have at least once power off/on the
> workstation, and the bios did not find any ssd to boot from. So power
> cycle didn't fix it, but replace ssd did fix it.
>
> I will try Titan again later today, but just looking for ideas.
>
With this history, I'd probably replace the workstation power supply. I
would also scan the
the system board for capacitors on bulging tops or overheated components.
Are there any externally powered devices connected to the workstation
(other than the monitor)?
Are you in an area with frequent lightning storms? How stable is your
power? Is the system
connected to a UPS?
I had a similar experience with spinning disks in a system that contained
a drive-bay radio receiver
and was connected to a satellite dish and GPS receiver on the roof, and an
antenna controller. Everything
was powered by a high quality UPS. I added a heavy wire connecting the
antenna controller case to the
workstation case and the failures stopped.
I gather you now have space for two m.2 SSD's. If you haven't discarded
the non-working devices,
it would be interesting to see if any are detected and what smartmontools
says about them, but
you also have the option to put /var on a separate drive. Smartmon tools
can monitor a drive and
report any problems it detects, but you may also want to run self-tests
periodically.
>
> Thanks,
> Neal
>
> On Tue, Feb 22, 2022 at 8:44 AM Richard Shaw <hobbes1069(a)gmail.com>
> wrote:
>
>> On Tue, Feb 22, 2022 at 7:34 AM Neal Becker <ndbecker2(a)gmail.com> wrote:
>>
>>> I know this is a bit OT, but you guys are great at answering all
>>> questions.
>>>
>>> I bought a workstation from Titan computers around 1/2020 (dual EPYC
>>> cpu). After about 1 year it stopped working. I could ssh to it, and
>>> almost any command would return Input/Output error. Unfortunately
>>> journalctl gave input/output error so I can't see logs. cat
>>> /proc/partitions did not show any nvme device (the root device) on which
>>> the OS was installed.
>>>
>>> I replaced the SSD with a samsung 980 pro. Reinstalled fedora. It
>>> then worked a few weeks, then the exact same symptoms.
>>>
>>> I replaced the SSD with another samsung 980 pro, this time with
>>> heatsink. Reinstalled fedora. It worked a few weeks. Then same symptoms.
>>>
>>> Then I replaced with a 4th samsung 980 pro, but this time instead of
>>> using the M.2 socket I used a pcie-m.2 adapter (in case something was wrong
>>> with the m.2 socket). Also added a surge protector outlet for good
>>> measure. Reinstalled. Watched the smartctl. No errors. Temperature was
>>> always low.
>>>
>>> Now it's failed again, exactly same symptoms.
>>>
>>> Any ideas?
>>>
>>
>> I remember your other email about a month or so ago and thought it was
>> really strange. Have you tried the drives in another system to confirm
>> they're truly dead?
>>
>> I would check for BIOS updates just for good measure. Other than that,
>> have you had any communication with Titan about it?
>>
>> Thanks,
>> Richard
>> _______________________________________________
>> users mailing list -- users(a)lists.fedoraproject.org
>> To unsubscribe send an email to users-leave(a)lists.fedoraproject.org
>> Fedora Code of Conduct:
>>
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
>> List Guidelines:
https://fedoraproject.org/wiki/Mailing_list_guidelines
>> List Archives:
>>
https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org
>> Do not reply to spam on the list, report it:
>>
https://pagure.io/fedora-infrastructure
>>
>
>
> --
> *Those who don't understand recursion are doomed to repeat it*
> _______________________________________________
> users mailing list -- users(a)lists.fedoraproject.org
> To unsubscribe send an email to users-leave(a)lists.fedoraproject.org
> Fedora Code of Conduct:
>
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
> List Guidelines:
https://fedoraproject.org/wiki/Mailing_list_guidelines
> List Archives:
>
https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org
> Do not reply to spam on the list, report it:
>
https://pagure.io/fedora-infrastructure
>
--
George N. White III
_______________________________________________
users mailing list -- users(a)lists.fedoraproject.org
To unsubscribe send an email to users-leave(a)lists.fedoraproject.org
Fedora Code of Conduct:
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines:
https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives:
https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org
Do not reply to spam on the list, report it:
https://pagure.io/fedora-infrastructure