Callum Lerwick wrote:
I would like to put in my +1 for this. Performance is pointless on
if
you can not trust that your data is safe. I have on many occasions run
fscks on my supposedly clean ext3 filesystems, only to find some mild
corruption. How can this happen? Isn't journaling supposed to prevent
this? One day I ran a fsck before doing some filesystem resizing, only
to find one of my irreplacible personal photos had become corrupted. I
had no way to know when or why this file got corrupted, it had been
written to disk some time ago and never touched since. I trusted
journaling, and it failed me.
Filesystem corruption can happen for many reasons, and journaling cannot
save you from them all. Think about bad cables, memory, kernel bugs,
bad hardware, rogue writes to the block device, etc. Journaling doesn't
help you in the face of any of these problems. If you are talking about
data corruption, it could have been an application bug for example (did
a photo editor corrupt it when you wrote the edited version?) There is
a long line of things which can go wrong, unfortunately. Trusting
journalling to keep all your data safe now and forevermore is misguided.
(Yes, I have a backup. I think...) After
this, I now turn on autofsck on all my machines, so that corruption at
least can't go undetected for years. Which means after a power fail it
takes my primary desktop with a pretty full 250gb drive 20-30 minutes to
come back up, which is incredibly irritating, but I have to know my data
is safe. I've even picked up a habit of obsessively checksumming all my
really important files. I wish the filesystem would help do this for me.
(ZFS...)
Knowing is half the battle. See, what can happen here, is a file can get
corrupted, and I may not notice until years later. By then I may have
cycled through several full backups, and long since lost the backup I
did have of the file...
This must be fixed. Only through a long painful process of losing faith
have I learned to not trust my filesystems. I suspect there are many
others out there who have been bitten by filesystem corruption and just
don't know it yet.
Only now do I learn the likely reason for this corruption. How would I
have reported this? I just assumed it was hardware glitches.
True, corruption from out-of-order writes due to lack of barriers is
hard to identify as such. But unfortunately there are a few other
things that could have gone wrong too. There are things in the works to
help on the integrity front, though (see
http://oss.oracle.com/projects/data-integrity/ for example).
-Eric