Last January, a server I run (Red Hat 9) was rebooted (I don't remember why). When it came back up, it reported problems with one partition and I was given a shell prompt and told (as best I can remember) to run fsck manually. I did and when it asked if I wanted to fix the problems it found, I said yes. Upon exiting and rebooting, the system was dead. Oh, it would boot but many of the files that were "fixed" by fsck were in /usr/sbin and they would not run. This included httpd, sendmail and sshd. It was a pain but I decided that this would be as good a time as any to upgrade from whatever old system I had to FC1. I had recently bought a new hard drive and installed to that, then mounted the old drive and was able to recover pretty much everything without having to dig into my "real" backups. Things went relatively smoothly and I soon had a happily running FC1 server. Deep sigh.
This morning I was working near the server and needed to move the UPS that it's plugged into. Unfortunately, I pressed the UPS power switch (which should be harder to press!) and the room went quiet! Yikes. Okay, don't panic. I turned things back on and the server started up. I got virtually the same message as last time which said to run fsck manually. Naturally, I was a bit worried about doing that but didn't see what choice I had. Knowing the worst that would happen is I'd have to go to backups, I went ahead and fixed the errors found. Fortunately for me, the system booted properly afterwards and everything seems to be running. There may be problems I haven't found yet but at least the main things are working.
The two incidents were 11 months apart and on different physical drives (although the rest of the hardware is the same). The system was not rebooted from the time I finished the upgrade in January until today. Sure, it is time (at least) to upgrade from FC1 to FC3 but I'm always happier doing it in MY time and not because I have to.
My question...
Is there something I should be doing to prevent this sort of thing? On a system that doesn't get rebooted very often, should fsck be run manually from time to time? Or would this just cause the same sort of problem? Any suggestions so that I don't have a repeat of this next November would be appreciated.
On Fri, 3 Dec 2004 14:48:40 -0500, Henry Hartley henryhartley@westat.com wrote:
My question...
Is there something I should be doing to prevent this sort of thing? On a system that doesn't get rebooted very often, should fsck be run manually from time to time? Or would this just cause the same sort of problem? Any suggestions so that I don't have a repeat of this next November would be appreciated.
Disks that run continuously for months/years are most likely to fail when unexpectedly powered down then powered up again. To prevent a single disk failure from taking down your server, you should always run (at a minimum) RAID-1 mirroring on two disks, or better yet, RAID-10 (i.e., mirrored stripes) over more disks. If economy of disk is important, you might consider RAID-5 instead.
If system failures are not considered an option, you might consider a SAN. But then you're talking real money.
fsck'ing a filesystem periodically probably isn't necessary; fsck is meant as a last resort to fix your filesystem when something bad has happened. Running fsck on a good filesystem probably wouldn't achieve much.
On Fri, 3 Dec 2004 14:48:40 -0500, Henry Hartley henryhartley@westat.com wrote:
Is there something I should be doing to prevent this sort of thing? On a
Are you running a journalled file system? Are you backing up?
I'm not sure I agree with the suggestion about RAID. If the drives are failing because of age then when one fails, you're going to have to replace them all. But frankly, I don't believe that they're dying of age - I've had drives last for over 3 years with 24 hour a day usage with only one or two reboots.
I'm also curious as to why things would be corrupt in /usr//sbin - nothing should be being written there except when you're installing software, so it's unlikely to have unflushed buffers. But perhaps you just have one big partition? I like to use lots of partitions (separate partitions for /, /var, /usr, /tmp, /home at least) so that what goes on in one partition isn't as likely to affect the others.
On Fri, 3 Dec 2004 18:14:15 -0500, Paul Tomblin ptomblin@gmail.com wrote:
I'm not sure I agree with the suggestion about RAID. If the drives are failing because of age then when one fails, you're going to have to replace them all. But frankly, I don't believe that they're dying of age - I've had drives last for over 3 years with 24 hour a day usage with only one or two reboots.
I work in a data centre; it's exceedingly rare for more than a single disk in an array to fail at a given time. That being said, I have seen situations where a single disk in a RAID 5 array has failed, and when the disk was replaced, another disk failed due to the age of the disk and the increased load caused by the resync operation. That's why backups are important.
Nonetheless, RAID offers a level of minimal protection that you just can't achieve by hopin' and prayin' nothing ever goes wrong with that single disk that has all your important stuff on it. :-)
Storage is like security -- layers of protection. A journalling filesystem, redundant disks, and backups are only parts of the equation, none of them is sufficient in and of itself.
hmm, what arrays are you using? It seems "more than common" with both Sun arrays and LSI (or IBM re-badged) arrays.
Anyone else have experiences, good or bad?
--Harry
Ben Steeves wrote:
I work in a data centre; it's exceedingly rare for more than a single disk in an array to fail at a given time. That being said, I have seen situations where a single disk in a RAID 5 array has failed, and when the disk was replaced, another disk failed due to the age of the disk and the increased load caused by the resync operation. That's why backups are important.
On Fri, 03 Dec 2004 18:30:26 -0500, Harry Hoffman hhoffman@ip-solutions.net wrote:
hmm, what arrays are you using? It seems "more than common" with both Sun arrays and LSI (or IBM re-badged) arrays.
Anyone else have experiences, good or bad?
Our company is having a horrific failure rate on the RAIDs in our Dells. Our service guys say that in the 6 months we've had them at the customer sites, we've had to replace nearly 30% of them.
On Fri, 2004-12-03 at 18:38 -0500, Paul Tomblin wrote:
On Fri, 03 Dec 2004 18:30:26 -0500, Harry Hoffman hhoffman@ip-solutions.net wrote:
hmm, what arrays are you using? It seems "more than common" with both Sun arrays and LSI (or IBM re-badged) arrays.
Anyone else have experiences, good or bad?
Our company is having a horrific failure rate on the RAIDs in our Dells. Our service guys say that in the 6 months we've had them at the customer sites, we've had to replace nearly 30% of them.
If you are having that high a failure rate on drives that are only 6 months old, you should get your vendor to replace the drives with a different model. Look for a model that has a record of long life, not just the one that is the "latest thing".
I work for a computer hardware company and sometimes the newest model is not always the most reliable.
On Fri, 2004-12-03 at 18:38 -0500, Paul Tomblin wrote:
On Fri, 03 Dec 2004 18:30:26 -0500, Harry Hoffman hhoffman@ip-solutions.net wrote:
hmm, what arrays are you using? It seems "more than common" with both Sun arrays and LSI (or IBM re-badged) arrays.
Anyone else have experiences, good or bad?
Our company is having a horrific failure rate on the RAIDs in our Dells. Our service guys say that in the 6 months we've had them at the customer sites, we've had to replace nearly 30% of them.
---- temperature matters - I had a problem at one office until I found that the server room went un-airconditioned in the building during the weekends in summer (this is Phoenix). I suspect this had a lot to do with a few disk failures that this customer had. Room now has it's own air conditioner.
Craig
On Sat, 2004-12-04 at 23:41 -0700, Craig White wrote:
Our company is having a horrific failure rate on the RAIDs in our Dells. Our service guys say that in the 6 months we've had them at the customer sites, we've had to replace nearly 30% of them.
We used to have this problem. It got solved when we installed add on fans into the pc cases specifically designed for hard drives.
Best regards
Marvin Dickens
On Fri, 03 Dec 2004 18:30:26 -0500, Harry Hoffman hhoffman@ip-solutions.net wrote:
hmm, what arrays are you using? It seems "more than common" with both Sun arrays and LSI (or IBM re-badged) arrays.
Anyone else have experiences, good or bad?
If your drives are failing more often before their MTBF time than after, I would say you should be talking to your vendor. There are a lot of components in a SCSI RAID subsystem (the controller, the backplane, the drives, and the software that runs on them all) -- it's a sticky puzzle to debug, no question.
Like I said before, there's no silver bullet -- you just need to have multiple levels of redundancy and plans in place when those redundancies fail.