Fedora 31, fully upgraded (I think...) Summary: twice in the last three days (since getting kernel 5.6.6, though I have no idea if that's related), this machine has hung completely after running "sudo dnf upgrade" while logged in over ssh. It becomes completely unresponsive to the network and to the keyboard, mouse, etc., I go to it to reboot. I've had to use the power button. The only strange things I see in the logs are long strings of null characters (that show as ^@) in the dnf.log when the dnf transactions crash the machine.
Here are more details. This is a Dell Precision T1700 that is my university office workstation as well as a IMAP server and web server for my use. Since we're working remotely and discouraged from going on campus, I hadn't been to the office since March 16, but in addition to reading and sending mail, I've been putting stuff on my course web pages, etc., as well as doing regular "sudo dnf upgrade" runs.
I got kernel 5.6.6 on Saturday, April 25. On Sunday, I did a dnf upgrade on my home machine, which is almost identically configured to the one in my office. That upgraded some git stuff, python3, samba, webkit2gtk3, etc., a total of 24 packages. That went fine, so I logged into my office machine and did dnf upgrade there. It offered pretty much the same list of packages, I say "y" and did work in another virtual desktop for a while. Then I got an error message saying claws-mail couldn't connect to the account on my office machine. It was dead to the network--I logged in on another machine in our department network and still couldn't connect to my office one. Eventually, I went to campus. The machine was unresponsive and I had to push the power button to get it to fully shut down. It then rebooted apparently normally, and ran fine. I didn't have time then to investigate very carefully, but I didn't see anything in the journalctl output that indicated what the problem was--there was just a complete gap in entries from the dnf transaction to when I rebooted.
The next day (yesterday) dnf upgrade worked normally and upgraded 6 packages: darktable, libmwaw, libstaroffice, libwps, openvpn, and python3-click-7.1.1-1.fc31.noarch.
Today, I did an upgrade on the home machine again, successfully, and then on the office machine. It offered to upgrade akmods-wireguard, which I realized I should have removed (since wireguard is in the 5.6 kernel). So I removed it and then did the upgrade. dnf wanted to upgrade google-chrome-stable, kde-print-manager and kde-print-manager-libs, libappindicator, libappindicator-gtk3, libuv, some net-snmp stuff, and python[23]-beautifulsoup4, all of which had been upgraded successfully on the home machine. I said yes and it did some deltarpm stuff, downloaded a couple of things, and then apparently froze completely, just like the last time.
Again, I had to go in to the office and use the power button to shutdown and restart. But now it seems ok and the rpms that it was supposedly upgrading do seem to have been upgraded--they're at the same versions as on my home machine.
I don't see anything very strange in any of the logs. Both times, the basic journalctl output just stops and doesn't start again until I reboot. However, in the dnf.log file after both of the bad upgrade transactions started, there are long strings of null characters (150 or so). These don't show up in the dnf.log on the home machine that has upgraded fine.
I've run fsck and rpm -V. There's plenty of disk space available in all partitions.
I don't have any idea what's going on and it's very inconvenient (not to mention strongly discouraged by the powers that be) to have to keep going on campus to restart the machine. So I'd be very grateful for suggests about how to figure this out, or at least stop it from happening again.
Thanks.
George
George Avrunin writes:
[..]
I don't have any idea what's going on and it's very inconvenient (not to mention strongly discouraged by the powers that be) to have to keep going on campus to restart the machine. So I'd be very grateful for suggests about how to figure this out, or at least stop it from happening again.
The capsule summary here is that the system appears to lock up under high I/O; either disk or network I/O. Doing a dnf upgrade puts a heavy load on both disk and network I/O. Network I/O only in the case of the update itself having to go out and download the updates from the repos. If all stuff's already downloaded, it's mostly just disk I/O.
You can prove that theory by simulating some load yourself. Something like
dd if=/dev/urandom of=/tmp/junk$$ bs=1M count=100 &
Kick this off a dozen times, or so, to write a gig worth of junk into /tmp (presuming there's space for it).
If this locks up the machine, there you go. If not, and you think your dnf upgrade was downloading stuff, try generating some network load. You'll have to have some bandwidth available yourself. You can take the dozen files of junk, put them in /var/www/html (presuming that apache is running), and wget them all, in parallel, off this machine from some other place.
For extra credit you can try generating both disk and network load.
If this turns out to reliably lock up this particular bit of hardware, there you go. What can you do about it? Very little. It's going to be either failing hardware (hard drive, power supply, or RAM), or a kernel bug. Looking up the spec sheet for your box, looks like both spinning rust and SSDs are an available option. You didn't say which one you have, but if your hard drive are spinning rust, that's the most like point of failure. Pretty much the only easily-accessible clue would be SMART diagnostics on the hard drive(s). See if there's anything there that tells you that the hard drive is on its last breath. The next easiest accessible clue is only available if you're physically at the machine, that would be a RAM tester. Do Fedora live images still include a memtest option, does anyone know?
You could be hitting a kernel bug. In the old days, I was rigging up a cross over on my PCs serial port, and configuring the kernel with a serial console, then capturing kernel OOPSes on the other machine, over the serial port. RS-232 ports are long gone. Have some vague recollection of serial over USB being an option. Another option worth exploring would be look into remote syslogging. Maybe the kernel can eke out an extra packet or two, to a remote syslog, before crashing.
But at least confirming that you can reliably reproduce a lockup by simulating high disk or network I/O is better than nothing.
On Tue, 28 Apr 2020 22:46:58 -0400, Sam Varshavchik wrote:
The capsule summary here is that the system appears to lock up under high I/O; either disk or network I/O. Doing a dnf upgrade puts a heavy load on both disk and network I/O. Network I/O only in the case of the update itself having to go out and download the updates from the repos. If all stuff's already downloaded, it's mostly just disk I/O.
Thanks. It does seem to be an issue of load, but it looks like the problem is due to the scheduler bug mentioned in the message from Jerry James. So far kernel 5.6.7 seems ok. (I switched schedulers before running dnf to install the kernel...)
George
On Tue, Apr 28, 2020 at 6:58 PM George Avrunin avrunin@math.umass.edu wrote:
Fedora 31, fully upgraded (I think...) Summary: twice in the last three days (since getting kernel 5.6.6, though I have no idea if that's related), this machine has hung completely after running "sudo dnf upgrade" while logged in over ssh. It becomes completely unresponsive to the network and to the keyboard, mouse, etc., I go to it to reboot. I've had to use the power button. The only strange things I see in the logs are long strings of null characters (that show as ^@) in the dnf.log when the dnf transactions crash the machine.
Andre Robatino posted a link to this bug awhile ago:
https://bugzilla.redhat.com/show_bug.cgi?id=1826091
That sounds like the symptoms you are seeing. It's a bug in the 5.6.6 kernel. It should be fixed in the 5.6.7 kernel, which will be available soon:
https://bodhi.fedoraproject.org/updates/FEDORA-2020-64c805f706
On Tue, 28 Apr 2020 21:04:10 -0600, Jerry James wrote:
Andre Robatino posted a link to this bug awhile ago:
https://bugzilla.redhat.com/show_bug.cgi?id=1826091
That sounds like the symptoms you are seeing. It's a bug in the 5.6.6 kernel. It should be fixed in the 5.6.7 kernel, which will be available soon:
https://bodhi.fedoraproject.org/updates/FEDORA-2020-64c805f706
Thanks! I'll look for 5.6.7 (and be careful when dnf installs it. ;-)
George
George Avrunin ha scritto il 29/04/20 alle 13:40:
On Tue, 28 Apr 2020 21:04:10 -0600, Jerry James wrote:
Andre Robatino posted a link to this bug awhile ago:
https://bugzilla.redhat.com/show_bug.cgi?id=1826091
That sounds like the symptoms you are seeing. It's a bug in the 5.6.6 kernel. It should be fixed in the 5.6.7 kernel, which will be available soon:
https://bodhi.fedoraproject.org/updates/FEDORA-2020-64c805f706
Thanks! I'll look for 5.6.7 (and be careful when dnf installs it. ;-)
George
users mailing list -- users@lists.fedoraproject.org To unsubscribe send an email to users-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org
installed 5.6.7 kernel on F32 machines and I thinkthat bug has been solved. Please check also on your side
On Wed, 29 Apr 2020 15:08:44 +0200, antonio montagnani wrote:
installed 5.6.7 kernel on F32 machines and I thinkthat bug has been solved. Please check also on your side
Thanks. I did install 5.6.7 this morning (after switching disk schedulers to avoid crashing while installing the kernel). Seems ok so far, but I'll know more after the next big dnf transaction.
George