In a first for me, my system froze solid doing an update today. It was a combination plasma updates and kernel updates.
Anyway, it took a bit of reinstalling and cleaning up to get a working system. Not too bad for me.
This probably would not have happened if the updates were done offline. Pretty sure it was doing a plasma component update when it froze. Going to have to rethink my procedures.
Hi.
On Fri, 30 Jul 2021 08:45:24 +0800 Ed Greshko wrote:
In a first for me, my system froze solid doing an update today.�� It was a combination plasma updates and kernel updates.
Was it a complete freeze or only a freeze of the graphical session ?
I always update with "nohup dnf -y update &", never had a total freeze, even when using the proprietary nVidia graphical driver.
Using systemd-run instead of nohup would be even safer.
On 30/07/2021 13:42, Francis.Montagnac@inria.fr wrote:
Hi.
On Fri, 30 Jul 2021 08:45:24 +0800 Ed Greshko wrote:
In a first for me, my system froze solid doing an update today. It was a combination plasma updates and kernel updates.
Was it a complete freeze or only a freeze of the graphical session ?
Total freeze. Unable to switch to an alternate console. ssh into the system was non-responsive.
I always update with "nohup dnf -y update &", never had a total freeze, even when using the proprietary nVidia graphical driver.
Using systemd-run instead of nohup would be even safer.
I've not had a freeze of any sort for years. Either during updates or during normal workloads.
On Fri, 2021-07-30 at 11:34 +0200, Roberto Ragusa wrote:
On 7/30/21 7:57 AM, Ed Greshko wrote:
Total freeze. Unable to switch to an alternate console. ssh into the system was non-responsive.
No ssh? Looks like a total system lockup that could have happened in offline mode as well.
I'd be inclined to agree.
poc
Patrick O'Callaghan writes:
On Fri, 2021-07-30 at 11:34 +0200, Roberto Ragusa wrote:
On 7/30/21 7:57 AM, Ed Greshko wrote:
Total freeze. Unable to switch to an alternate console. ssh into the system was non-responsive.
No ssh? Looks like a total system lockup that could have happened in offline mode as well.
I'd be inclined to agree.
I don't even know how to do an offline upgrade. Most of my systems precede dnf-dragora, I think that's the current desktop upgrade app. It's not even installed.
Besides, the last time I did a system-upgrade, I watched the boot messages. I saw lots of userspace stuff getting started. Looks to me like offline really only means no user sessions or desktop. The only think that you take out of the picture are X. Although this would mitigate problems updating X or applications, all the low level stuff is still running. And that's the stuff that would cause the greatest deal of chaos if it blows up during an update.
On 30/07/2021 19:04, Sam Varshavchik wrote:
Patrick O'Callaghan writes:
On Fri, 2021-07-30 at 11:34 +0200, Roberto Ragusa wrote:
On 7/30/21 7:57 AM, Ed Greshko wrote:
Total freeze. Unable to switch to an alternate console. ssh into the system was non-responsive.
No ssh? Looks like a total system lockup that could have happened in offline mode as well.
I'd be inclined to agree.
I don't even know how to do an offline upgrade. Most of my systems precede dnf-dragora, I think that's the current desktop upgrade app. It's not even installed.
The KDE Discover app. Called by the packagekit icon on the systray.
Anyway. One time since 2017. Not so bad. Only had to reinstall 13 packages and the new kernel.
Total lockup would have to be a complete kernel crash during the update. As others have said, offline would probably not reduce the risk of this sort of crash.
The updates typically don't add/remove modules and/or otherwise change the live running kernel components.
On the enterprise side there are some live in-memory patching software that do update the running kernels code and I have seen those crash systems.
On Fri, Jul 30, 2021 at 5:30 AM Patrick O'Callaghan pocallaghan@gmail.com wrote:
On Fri, 2021-07-30 at 11:34 +0200, Roberto Ragusa wrote:
On 7/30/21 7:57 AM, Ed Greshko wrote:
Total freeze. Unable to switch to an alternate console. ssh into the system was non-responsive.
No ssh? Looks like a total system lockup that could have happened in offline mode as well.
I'd be inclined to agree.
poc _______________________________________________ users mailing list -- users@lists.fedoraproject.org To unsubscribe send an email to users-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure
On 31/07/2021 01:43, Roger Heflin wrote:
Total lockup would have to be a complete kernel crash during the update. As others have said, offline would probably not reduce the risk of this sort of crash.
I think it is unwise to assume it was a "kernel" crash.
As I mentioned, the screen showed a plasma component being updated at the time of the freeze.
No matter. As I said this is my first system freeze since probably 2017. Heck, it could have been caused by a voltage spike for all I know since I spent 0 time looking at logs. I just spent time cleaning up.
I'm sorry I even mentioned this. All it does, and I should have known better, is bring up speculation.
My participation in this thread I started is at an end. :-)
If it was just a plasma crash, then ssh and/or the alt keys would have worked to switch terminals.
Details said neither worked. The kernel and/or a significant part of userspace was deadlocked and/or crashed.
On Fri, Jul 30, 2021 at 1:11 PM Ed Greshko ed.greshko@greshko.com wrote:
On 31/07/2021 01:43, Roger Heflin wrote:
Total lockup would have to be a complete kernel crash during the update. As others have said, offline would probably not reduce the risk of this sort of crash.
I think it is unwise to assume it was a "kernel" crash.
As I mentioned, the screen showed a plasma component being updated at the time of the freeze.
No matter. As I said this is my first system freeze since probably 2017. Heck, it could have been caused by a voltage spike for all I know since I spent 0 time looking at logs. I just spent time cleaning up.
I'm sorry I even mentioned this. All it does, and I should have known better, is bring up speculation.
My participation in this thread I started is at an end. :-)
-- Remind me to ignore comments which aren't germane to the thread.
users mailing list -- users@lists.fedoraproject.org To unsubscribe send an email to users-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure
On Fri, Jul 30, 2021 at 2:00 PM Roger Heflin rogerheflin@gmail.com wrote:
If it was just a plasma crash, then ssh and/or the alt keys would have worked to switch terminals.
Details said neither worked. The kernel and/or a significant part of userspace was deadlocked and/or crashed.
I wonder if logs contain anything... i.e. from the boot following the failed update, use journalctl -b-1 and if it's 5 boots back use -b-5
It might have the start of the problem anyway. I also suspect a deadlock. It can make it seem like ssh is dead but it's just super slow. Or may even time out unless a session has already started. Workstation edition and KDE spin have improved resource control, which is a work in-progress (also on KDE you will need to install uresourced). This attempts to ensure minimum resources are available for the desktop to be responsive. One possible limitation is IO pressure, we're not quite there yet implementing IO isolation. A deadlock though is a different problem so the resource control work wouldn't help.
If you ever see "task xxx:yyy blocked for more than 120 seconds" it's best to issue sysrq+w (i.e. echo w > /proc/sysrq-trigger) to dump extra debugging information into the kernel message buffer, and then file a bug attaching dmesg.
Adding to what Chris suggests.
When ssh fails, always ping the ip address. If the ping responds then the kernel is up in some state (during heavy paging/deadlocks ping generally responds if the kernel is still running and has not crashed). If ping does not respond either the network has died (typically the network does not usually stop responding unless someone screws up and takes it down--though I do know of at least one network card crash that I have seen drop the network many times--but it is easy to diag since it logs the issue) or the kernel has crashed because of something.
Enabling crash dumps might be a good idea, if the crash does not collect and/or try to collect and the node boots back up then that is often a sign of a hardware fault that forced an immediate reset of the hardware.
On Sat, Jul 31, 2021 at 1:01 AM Chris Murphy lists@colorremedies.com wrote:
On Fri, Jul 30, 2021 at 2:00 PM Roger Heflin rogerheflin@gmail.com wrote:
If it was just a plasma crash, then ssh and/or the alt keys would have worked to switch terminals.
Details said neither worked. The kernel and/or a significant part of userspace was deadlocked and/or crashed.
I wonder if logs contain anything... i.e. from the boot following the failed update, use journalctl -b-1 and if it's 5 boots back use -b-5
It might have the start of the problem anyway. I also suspect a deadlock. It can make it seem like ssh is dead but it's just super slow. Or may even time out unless a session has already started. Workstation edition and KDE spin have improved resource control, which is a work in-progress (also on KDE you will need to install uresourced). This attempts to ensure minimum resources are available for the desktop to be responsive. One possible limitation is IO pressure, we're not quite there yet implementing IO isolation. A deadlock though is a different problem so the resource control work wouldn't help.
If you ever see "task xxx:yyy blocked for more than 120 seconds" it's best to issue sysrq+w (i.e. echo w > /proc/sysrq-trigger) to dump extra debugging information into the kernel message buffer, and then file a bug attaching dmesg.
-- Chris Murphy _______________________________________________ users mailing list -- users@lists.fedoraproject.org To unsubscribe send an email to users-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure