Re: copy full system from old disk to a new one

Gordan Bobic <gordan@xxxxxxxxxx> · Wed, 20 Feb 2013 10:52:44 +0000

On 20/02/2013 06:22, Steve Ellis wrote:

On Tue, Feb 19, 2013 at 3:52 PM, Gordan Bobic <gordan@xxxxxxxxxx
<mailto:gordan@xxxxxxxxxx>> wrote:

    On 02/19/2013 10:00 PM, Reindl Harald wrote:

    No, my experience does not go as far back 6 years for obvious
    reasons. My exprience with mechanical disks, however, goes as far
    back as 25 years, and I can promise you, they are every bit as
    unreliable as you fear the SSDs might be.

So, my experience with mechanical disks dates back 25 years as well (my
first was a 5.25" HH 20M in a PC I bought in 1986), but I've had more
frightening experiences with SSDs (and yet I still use them) than I have
with conventional drives.  I've had 3 complete and total failures of
name-brand SSD (all from the same vendor, unfortunately) within the
course of 1 year, all drives were less than 1 year old, and were
deployed in fairly conventional desktop machines--one was a warranty
replacement of an earlier failure.  I've had unpleasant experiences with
conventional disks as well, but I don't believe I've ever had more than
one conventional drive fail so completely that _no_ data could be
recovered--all of my SSD failures were like that.

3 out of how many? Bad models happen all the time, for example 
ST31000340AS. I originally bought 4, and out of those 4 drives I 
originally bought I've had 6 replacements under warranty so far (6 
months of warranty left). Some with total media failure, some bricked 
completely after a secure-erase (often the only way to get some of the 
pending sectors to reallocate reliably with some drives with broken 
firmwares). Some just ran out of spare sectors to reallocate (softest 
failure I've seen on those.

A few years ago I got a pair of HD501LJ drives - both suffered massive 
media failure, and while no doubt some data would have been recoverable 
it would have taken so long with the failing drives restoring onto a 
fresh pair of drives was more expedient. It took, IIRC, 8 replacement 
drives to actually get a pair that fully worked and passed all of it's 
built in SMART tests.

I wrote up some of the experience here
http://www.altechnative.net/2011/03/21/the-appalling-design-of-hard-disks/

along with other "shouldn't happen" failure modes.

I'm not saying SSDs are any better, but I don't think they are any 
worse, either.

        data without a raid are useless

    My point was that even RAID is next to useless because it doesn't
    protect you against bit-rot.

As we all know, both conventional drives (and I believe SSDs) use
extensive error detection/correction so that the drive will know if a
block is unreliable (most of the time the drive will manage to remap
that block elsewhere before it becomes unrecoverable)

Drives simply do not do that in normal operation. Once the sector rots 
out, it'll get reallocated on next write and you'll lose it's contents.

The only case where the drive will automatically do any re-mapping 
before data loss occurs is when Write-Read-Verify feature is enabled:

http://www.altechnative.net/2011/04/06/enabling-write-read-verify-feature-on-disks/

I upstreamed a patch to toggle this to hdparm (it's now been in the main 
release for a year or so).

Unfortunately, very few disks have this feature. I've only found 
Seagates to have it, and not even all of them.

--individual drives
only _very_ rarely manage to return the wrong data (I'm actually not
sure I've _ever_ seen that happen).

I've seen it happen pretty regularly. Healing the pending sectors tends 
to have massive knock on effects on performance, though, especially if 
there is more than one (and they usually come in sets).

I just stick with ZFS - I can run SMART tests to identify the bad 
sectors, then just clobber them to try to get them to reallocate. Scrub 
can then still use the checksums to establish which is the correct set 
of block to reassemble and restore the data to the one that has been 
clobbered. Far better and more reliable than depending on traditional RAID.

The problem with RAID is when no one is looking to see if the RAID
system had to correct blocks--once you see more than a couple of RAID
corrections happen, it is time to replace a disk--if no one looks at the
logs, then eventually, there will be double (or in the case of RAID6,
triple) failure, and you will lose data.

Replacing disks after only  couple of reallocated sectors is going to 
get expensive. Most disks today have the specified unrecoverable error 
rate of 10^-14 bits, which means an unrecoverable sector every 11TB or 
so. So if you have a 5+1 RAID5 array, and you lose a disk, the chances 
of encountering an unrecoverable sector during rebuild is about 50% - 
not good.

A further problem with RAID is
when some of the blocks are never read.  Any reasonable RAID controller
will not only make the log of RAID corrections available (mine helpfully
emails me when corrections happen), but will also have the option of
scanning the entire RAID volume periodically to look for undetected
individual block failures (my system does this scan 2x per week).  I've
never used software RAID, so I don't know if these options are available
(but I assume they are).  It would be suicidal to rely on any RAID
system that didn't offer both logs of corrections as well as an easy way
to scan every single block (including unused blocks) looking for
unnoticed issues.

Personally I find that even this isn't good enough of late. ZFS can do 
this in a much more robust way.

        and in case of
        RAID you have to have at least one full backup
        and so it does not care me if disks are dying

    Depends how many versioned backups you have, I suppose. It is
    possible to not notice RAID silenced bit-rot for a long time,
    especially with a lot of data.

I have a 5x1TB RAID5 (plus 1 hot spare) system (I suppose this is no
longer considered a lot of data, but it was to me when I built it) that
has _never_ had an unrecoverable problem--and I've now replaced every
drive at least once (and I just started a migration to a 3x3TB RAID 5 w/
spare before any more fail)--I built my system in late 2003 (with 250GB
drives), and the only time the RAID system has been down for more than a
few minutes is when I migrate either drives or controller (or when I
upgrade fedora).

3x3TB RAID5 is _brave_, IMO. But hey, it's your data. :)

Gordan
--
users mailing list
users@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe or change subscription options:
https://admin.fedoraproject.org/mailman/listinfo/users
Guidelines: http://fedoraproject.org/wiki/Mailing_list_guidelines
Have a question? Ask away: http://ask.fedoraproject.org