I recently upgraded my wife's laptop to an Acer with a fairly stock i5-6200U system using the UFEI boot (that was an adventure)
About a month ago I started getting strange issues. Not lockups per se but it seems that the drive is somehow going into read only. Since that's the case after it starts nothing is getting written to disk (includes journal entries).
Also the virtual terminals are useless:
https://lh4.googleusercontent.com/-L6zfYtxT5y8/V283nqZgbEI/AAAAAAAABys/jhiEH...
Interestingly enough, a program already open will kind of work but any attempt to launch a new program from gnome shell fails.
Because of the lack of log entries this has been extremely difficult to troubleshoot.
Any ideas?
Thanks, Richard
On Sat, Jun 25, 2016, 8:26 PM Richard Shaw hobbes1069@gmail.com wrote:
I recently upgraded my wife's laptop to an Acer with a fairly stock i5-6200U system using the UFEI boot (that was an adventure)
About a month ago I started getting strange issues. Not lockups per se but it seems that the drive is somehow going into read only. Since that's the case after it starts nothing is getting written to disk (includes journal entries).
Include rd.break=pre-mount as a boot parameter. Then manually mount the root volume to /sysroot
There may be messages in the same console, or you may need to use dmesg. Screen capture those error messages.
Chris Murphy
On Sun, Jun 26, 2016 at 12:33 AM, Chris Murphy lists@colorremedies.com wrote:
On Sat, Jun 25, 2016, 8:26 PM Richard Shaw hobbes1069@gmail.com wrote:
I recently upgraded my wife's laptop to an Acer with a fairly stock i5-6200U system using the UFEI boot (that was an adventure)
About a month ago I started getting strange issues. Not lockups per se but it seems that the drive is somehow going into read only. Since that's the case after it starts nothing is getting written to disk (includes journal entries).
Include rd.break=pre-mount as a boot parameter. Then manually mount the root volume to /sysroot
There may be messages in the same console, or you may need to use dmesg. Screen capture those error messages.
That laptop actually boots and about half the time it makes it through the whole day before it goes on the fritz... It looks like your suggestion is if the volume was always readonly...
An additional quirk I discovered, most of the time I can attempt a ssh connection but it doesn't matter which user I try, it always says it's a bad password even though I'm sure it's correct.
Thanks, Richard
On Tue, Jun 28, 2016 at 6:24 AM, Richard Shaw hobbes1069@gmail.com wrote:
On Sun, Jun 26, 2016 at 12:33 AM, Chris Murphy lists@colorremedies.com wrote:
On Sat, Jun 25, 2016, 8:26 PM Richard Shaw hobbes1069@gmail.com wrote:
I recently upgraded my wife's laptop to an Acer with a fairly stock i5-6200U system using the UFEI boot (that was an adventure)
About a month ago I started getting strange issues. Not lockups per se but it seems that the drive is somehow going into read only. Since that's the case after it starts nothing is getting written to disk (includes journal entries).
Include rd.break=pre-mount as a boot parameter. Then manually mount the root volume to /sysroot
There may be messages in the same console, or you may need to use dmesg. Screen capture those error messages.
That laptop actually boots and about half the time it makes it through the whole day before it goes on the fritz... It looks like your suggestion is if the volume was always readonly...
At a minimum we need kernel messages for when it's misbehaving to get an idea why the fs is mounting read only. The journal is even better, but if at boot time the system will not mount rootfs rw, then the journal can't be written to stable media and is lost on a hard reset.
From a successful boot you could do 'systemctl enable debug-shell.service' which will get you a console on tty9 much much earlier in the boot process, where maybe if you end up in this read only state again, you can switch to tty9 and write out the current boot journal to a thumb drive or even to the /boot partition by mounting it somewhere.
Another option is to let us know what file system this is, and what results you get if you boot from Fedora install media, and run a file system check on the root file system.
An additional quirk I discovered, most of the time I can attempt a ssh connection but it doesn't matter which user I try, it always says it's a bad password even though I'm sure it's correct.
In that case it's file corruption. Could be a slowly imploding file system. Could be a slowly dying drive that's returning spurious data. So in addition to a file system check, you could supply the results from 'smartctl -x /dev/sdX'
And still another thing you could do is grab a few days worth of journal using 'sudo journalctl --since=-3days > journal_3days.log' and put that up somewhere and post the URL here. Maybe there are clues of what the problem is buried in a successful boot where there was a read write root.
Ok, apparently I need to reboot this conversation :)
The filesystem is not corrupted (other than what occurs during a forced power off). Many days, it boots and runs just fine for the whole day, when it does have a problem, it's not a boot but rather towards the latter part of the day.
So something is happening that as far as I can tell is not captured in the journal and once it occurs it's not possible to get to a VT or ssh in in order to discover the problem. The picture I posted is more likely the result of whatever is going on so it's not terribly helpful either.
Thanks, Richard
On Thu, Jun 30, 2016, 6:21 AM Richard Shaw hobbes1069@gmail.com wrote:
Ok, apparently I need to reboot this conversation :)
The filesystem is not corrupted (other than what occurs during a forced power off). Many days, it boots and runs just fine for the whole day, when it does have a problem, it's not a boot but rather towards the latter part of the day.
Not at boot? You mean it becomes read only after being read write all day?
So something is happening that as far as I can tell is not captured in the journal and once it occurs it's not possible to get to a VT or ssh in in order to discover the problem. The picture I posted is more likely the result of whatever is going on so it's not terribly helpful either.
The filesystem going read only means it became confused and is stopping writes to hopefully avoid corrupting the filesystem.
If the revert to read only happens well after boot, rather than failure to remount read write during boot, login remotely and monitor with 'journalctl -f' chances are that remote machine will capture the central events just as it goes read only and before whatever else kills sshd preventing you from remotely logging in after the fact.
Chris Murphy
Thanks, Richard -- users mailing list users@lists.fedoraproject.org To unsubscribe or change subscription options: https://lists.fedoraproject.org/admin/lists/users@lists.fedoraproject.org Fedora Code of Conduct: http://fedoraproject.org/code-of-conduct Guidelines: http://fedoraproject.org/wiki/Mailing_list_guidelines Have a question? Ask away: http://ask.fedoraproject.org
On Thu, Jun 30, 2016 at 8:56 AM, Chris Murphy lists@colorremedies.com wrote:
On Thu, Jun 30, 2016, 6:21 AM Richard Shaw hobbes1069@gmail.com wrote:
Ok, apparently I need to reboot this conversation :)
The filesystem is not corrupted (other than what occurs during a forced power off). Many days, it boots and runs just fine for the whole day, when it does have a problem, it's not a boot but rather towards the latter part of the day.
Not at boot? You mean it becomes read only after being read write all day?
So something is happening that as far as I can tell is not captured in the journal and once it occurs it's not possible to get to a VT or ssh in in order to discover the problem. The picture I posted is more likely the result of whatever is going on so it's not terribly helpful either.
The filesystem going read only means it became confused and is stopping writes to hopefully avoid corrupting the filesystem.
If the revert to read only happens well after boot, rather than failure to remount read write during boot, login remotely and monitor with 'journalctl -f' chances are that remote machine will capture the central events just as it goes read only and before whatever else kills sshd preventing you from remotely logging in after the fact.
What I would do is ssh in before it face plants; use 'journalctl -b -o short-monotonic --no-pager' to capture everything up to this point; and then 'journalctl -f -o short-monotonic' to follow from this point. I've not ever seen a file system get forced to read only file system without a lot of kernel messages including a call trace.
If the problem totally obviates using sshd, then use netconsole. Clearly the kernel, systemd, and systemd-journald survive long enough to report the fs has gone read only. So netconsole would allow a remote machine to capture what happened right before this, using the same commands as above.
Ok, I think I was able to catch the error without having to get that desparate :)
I logged in from work and did journalctl -f and got the following:
Looks disk hardware related but it's not getting worse so I doubt it's the disk itself, driver problem instead?
Thanks, Richard
On 06/30/2016 12:40 PM, Richard Shaw wrote:
Ok, I think I was able to catch the error without having to get that desparate :)
I logged in from work and did journalctl -f and got the following:
Looks disk hardware related but it's not getting worse so I doubt it's the disk itself, driver problem instead?
Could be the disk. What does "smartctl -x /dev/sda" show? Could also be temperature-related, too. Laptops can get hot and cause many weird things. ---------------------------------------------------------------------- - Rick Stevens, Systems Engineer, AllDigital ricks@alldigital.com - - AIM/Skype: therps2 ICQ: 226437340 Yahoo: origrps2 - - - - Vegetarian: Old Indian word for "lousy hunter" - ----------------------------------------------------------------------
On Thu, Jun 30, 2016 at 1:40 PM, Richard Shaw hobbes1069@gmail.com wrote:
Ok, I think I was able to catch the error without having to get that desparate :)
I logged in from work and did journalctl -f and got the following:
Looks disk hardware related but it's not getting worse so I doubt it's the disk itself, driver problem instead?
Jun 30 14:11:38 ladyhobbes kernel: ata1.00: exception Emask 0x0 SAct 0x800 SErr 0x50000 action 0x6 frozen Jun 30 14:12:38 ladyhobbes kernel: ata1: SError: { PHYRdyChg CommWake } Jun 30 14:12:38 ladyhobbes kernel: ata1.00: failed command: WRITE FPDMA QUEUED Jun 30 14:12:38 ladyhobbes kernel: ata1.00: cmd 61/20:58:00:80:3b/01:00:0e:00:00/40 tag 11 ncq 147456 out res 40/00:37:00:00:00/00:00:00:00:00/40 Emask 0x4 (timeout) Jun 30 14:12:38 ladyhobbes kernel: ata1.00: status: { DRDY } Jun 30 14:12:38 ladyhobbes kernel: ata1: hard resetting link
It's the drive itself. The drive is failing to respond on a write command, hangs, exceeds the SCSI command timer, which then times out and hard resets the link. This happens several times, and ext4 needs to write its journal superblock, can't, gets pissed, and gives up and goes read only.
So the short version is the drive needs to be replaced. Write failures are always disqualifying. If this were an md software raid device, md would immediately mark the drive faulty on a single one of these kinds of errors.
Due to arguably flawed engineering, the drive is apparently taking more than 30 seconds to figure out it can't write to this sector, which is pretty messed up. And then the Linux kernel default is arguably too short so it's giving up before we get a discrete proper error message from the drive as to what's going on. This is now a common misconfiguration that's actually quite bad, but itself an edge case since read errors are so rare and write errors are even more rare.
Anyway if you really want to play with this more, you can do:
smartctl -l scterc /dev/sdX ## this will reveal the SCT support and setting for the drive which must always be shorter than the kernel's command timer. I spect this to be disabled which means the value is unknown but could be as high as 180 seconds.
cat /sys/block/sdX/device/timeout ##this will reveal the kernel command timer, which by default is 30 so I expect it to be 30.
Proper configuration means the drive gives up on errors before the kernel does, i.e. the first value needs to be less than the second. Pretty much no one has this unless they're using enterprise or NAS drives. So diplomatically this situation I'd call totally fucked. Undiplomatically, well, that involves drinking games first.
For the latest in this saga, I posted this a few days ago on linux-raid@ which is upstream for all things Linux RAID but in particular md, which ends up being the hardest hit by this problem as it eventually results in things like total raid5 (even raid6) collapse when it should be able to survive. http://marc.info/?l=linux-raid&m=146704573129021&w=2
And yes, I understand this thread involves one drive, but the misconfiguration is a problem there too because manufacturers expect consumer drives to have these "deep" recoveries for marginally bad sectors, that can take (seriously) upwards of 3 minutes to sort out, during which time the drive is unresponsive. And right now Linux will have none of that, and just resets the drive. That's solvable for read errors by increasing the kernel command timer. It's not solvable, probably, for write errors. I think if you increase the kernel command timer to 180 by using 'echo 180 > /sys...' what'll happen is you'll just get a discrete write error from the drive eventually.
So yeah, replace the drive.
Thanks for the detailed reply so forgive me for not quoting :)
I'm not ready to believe the drive is bad, I think what happened is that I didn't realize it but playing around with cockpit I accidentally installed tuned, which is interesting in concept but the aggressive power management did not play nice with the hard drive. I have since removed tuned and going to monitor things for a while.
Thanks Richard
On Fri, Jul 1, 2016 at 4:19 PM, Richard Shaw hobbes1069@gmail.com wrote:
Thanks for the detailed reply so forgive me for not quoting :)
I'm not ready to believe the drive is bad, I think what happened is that I didn't realize it but playing around with cockpit I accidentally installed tuned, which is interesting in concept but the aggressive power management did not play nice with the hard drive. I have since removed tuned and going to monitor things for a while.
It's actually a fair point that without a discrete error message from the drive, it isn't necessarily the drive. And that's the problem with the "hard resetting link" message, is it obscures the actual problem. It could be the drive, connectors, cable, or controller - including going into some power save mode and not waking up (in time) and causing problems. All we know is a write command was sent, and there was no response, and the kernel command timer expired and started to do link resets, that clears the whole command queue (that's possibly 31 tags) and the ensuing noop back to ext4 basically made it go WTF rather than just requeue (?) So you end up with a bunch of scary messages...
On Sat, Jul 2, 2016 at 1:55 PM, Chris Murphy lists@colorremedies.com wrote:
On Fri, Jul 1, 2016 at 4:19 PM, Richard Shaw hobbes1069@gmail.com wrote:
Thanks for the detailed reply so forgive me for not quoting :)
I'm not ready to believe the drive is bad, I think what happened is that I didn't realize it but playing around with cockpit I accidentally installed tuned, which is interesting in concept but the aggressive power management did not play nice with the hard drive. I have since removed tuned and going to monitor things for a while.
It's actually a fair point that without a discrete error message from the drive, it isn't necessarily the drive. And that's the problem with the "hard resetting link" message, is it obscures the actual problem. It could be the drive, connectors, cable, or controller - including going into some power save mode and not waking up (in time) and causing problems. All we know is a write command was sent, and there was no response, and the kernel command timer expired and started to do link resets, that clears the whole command queue (that's possibly 31 tags) and the ensuing noop back to ext4 basically made it go WTF rather than just requeue (?) So you end up with a bunch of scary messages...
smartctl -x /dev/sdX might reveal something. Depending on what reporting features it has, it might record command errors. But in any case, the attributes list would show some suspicious counts, in particular reallocated sectors. If that's 0 or not pretty high (dozens) then the drive still has reserve sectors and the write error is bogus. It just happened to be a write command the drive didn't respond to rather than failure to write.