copy full system from old disk to a new one

Wed Feb 20 10:52:44 UTC 2013

On 20/02/2013 06:22, Steve Ellis wrote:
>
>
>
> On Tue, Feb 19, 2013 at 3:52 PM, Gordan Bobic <gordan at bobich.net
> <mailto:gordan at bobich.net>> wrote:
>
>     On 02/19/2013 10:00 PM, Reindl Harald wrote:
>
>
>
>     No, my experience does not go as far back 6 years for obvious
>     reasons. My exprience with mechanical disks, however, goes as far
>     back as 25 years, and I can promise you, they are every bit as
>     unreliable as you fear the SSDs might be.
>
> So, my experience with mechanical disks dates back 25 years as well (my
> first was a 5.25" HH 20M in a PC I bought in 1986), but I've had more
> frightening experiences with SSDs (and yet I still use them) than I have
> with conventional drives.  I've had 3 complete and total failures of
> name-brand SSD (all from the same vendor, unfortunately) within the
> course of 1 year, all drives were less than 1 year old, and were
> deployed in fairly conventional desktop machines--one was a warranty
> replacement of an earlier failure.  I've had unpleasant experiences with
> conventional disks as well, but I don't believe I've ever had more than
> one conventional drive fail so completely that _no_ data could be
> recovered--all of my SSD failures were like that.

3 out of how many? Bad models happen all the time, for example 
ST31000340AS. I originally bought 4, and out of those 4 drives I 
originally bought I've had 6 replacements under warranty so far (6 
months of warranty left). Some with total media failure, some bricked 
completely after a secure-erase (often the only way to get some of the 
pending sectors to reallocate reliably with some drives with broken 
firmwares). Some just ran out of spare sectors to reallocate (softest 
failure I've seen on those.

A few years ago I got a pair of HD501LJ drives - both suffered massive 
media failure, and while no doubt some data would have been recoverable 
it would have taken so long with the failing drives restoring onto a 
fresh pair of drives was more expedient. It took, IIRC, 8 replacement 
drives to actually get a pair that fully worked and passed all of it's 
built in SMART tests.

I wrote up some of the experience here
http://www.altechnative.net/2011/03/21/the-appalling-design-of-hard-disks/

along with other "shouldn't happen" failure modes.

I'm not saying SSDs are any better, but I don't think they are any 
worse, either.

>         data without a raid are useless
>
>
>     My point was that even RAID is next to useless because it doesn't
>     protect you against bit-rot.
>
> As we all know, both conventional drives (and I believe SSDs) use
> extensive error detection/correction so that the drive will know if a
> block is unreliable (most of the time the drive will manage to remap
> that block elsewhere before it becomes unrecoverable)

Drives simply do not do that in normal operation. Once the sector rots 
out, it'll get reallocated on next write and you'll lose it's contents.

The only case where the drive will automatically do any re-mapping 
before data loss occurs is when Write-Read-Verify feature is enabled:

http://www.altechnative.net/2011/04/06/enabling-write-read-verify-feature-on-disks/

I upstreamed a patch to toggle this to hdparm (it's now been in the main 
release for a year or so).

Unfortunately, very few disks have this feature. I've only found 
Seagates to have it, and not even all of them.

> --individual drives
> only _very_ rarely manage to return the wrong data (I'm actually not
> sure I've _ever_ seen that happen).

I've seen it happen pretty regularly. Healing the pending sectors tends 
to have massive knock on effects on performance, though, especially if 
there is more than one (and they usually come in sets).

I just stick with ZFS - I can run SMART tests to identify the bad 
sectors, then just clobber them to try to get them to reallocate. Scrub 
can then still use the checksums to establish which is the correct set 
of block to reassemble and restore the data to the one that has been 
clobbered. Far better and more reliable than depending on traditional RAID.

> The problem with RAID is when no one is looking to see if the RAID
> system had to correct blocks--once you see more than a couple of RAID
> corrections happen, it is time to replace a disk--if no one looks at the
> logs, then eventually, there will be double (or in the case of RAID6,
> triple) failure, and you will lose data.

Replacing disks after only  couple of reallocated sectors is going to 
get expensive. Most disks today have the specified unrecoverable error 
rate of 10^-14 bits, which means an unrecoverable sector every 11TB or 
so. So if you have a 5+1 RAID5 array, and you lose a disk, the chances 
of encountering an unrecoverable sector during rebuild is about 50% - 
not good.

> A further problem with RAID is
> when some of the blocks are never read.  Any reasonable RAID controller
> will not only make the log of RAID corrections available (mine helpfully
> emails me when corrections happen), but will also have the option of
> scanning the entire RAID volume periodically to look for undetected
> individual block failures (my system does this scan 2x per week).  I've
> never used software RAID, so I don't know if these options are available
> (but I assume they are).  It would be suicidal to rely on any RAID
> system that didn't offer both logs of corrections as well as an easy way
> to scan every single block (including unused blocks) looking for
> unnoticed issues.

Personally I find that even this isn't good enough of late. ZFS can do 
this in a much more robust way.

>         and in case of
>         RAID you have to have at least one full backup
>         and so it does not care me if disks are dying
>
>
>     Depends how many versioned backups you have, I suppose. It is
>     possible to not notice RAID silenced bit-rot for a long time,
>     especially with a lot of data.
>
>
> I have a 5x1TB RAID5 (plus 1 hot spare) system (I suppose this is no
> longer considered a lot of data, but it was to me when I built it) that
> has _never_ had an unrecoverable problem--and I've now replaced every
> drive at least once (and I just started a migration to a 3x3TB RAID 5 w/
> spare before any more fail)--I built my system in late 2003 (with 250GB
> drives), and the only time the RAID system has been down for more than a
> few minutes is when I migrate either drives or controller (or when I
> upgrade fedora).

3x3TB RAID5 is _brave_, IMO. But hey, it's your data. :)

Gordan