Re: GIT vs Other: Need argument

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On Tue, 17 Apr 2007, Matthieu Moy wrote:
> 
> * Perhaps your boss will be interested in the "data integrity" (i.e.
>   git fsck) problem too.

The data integrity thing is a lot more than just fsck.

I care a lot about my data, and it's an area where a *lot* of systems fall 
down. CVS is just about the worst (basically no checksums or sanity 
checking anywhere), and you can pretty much have total data corruption 
without ever even _realizing_, until you try to get some old version.

Even more interesting with CVS is that you can have total data corruption 
and you'll not realize it *even*as* you use the data. Lots of people and 
projects have been known to happily move *,v files around and edit the 
CVS repo files by hand to make things look right, which means that not 
only did you do a "rename" in CVS, you actually renamed *retroactively* 
too - you made history look wrong!

So with CVS, you actually have no guarantees what-so-ever that when you 
check out something old, you'll get what you actually used to have. You 
can tag things as much as you want - if people end up editing the CVS 
files (and people *do* that), you'll never have any indication that the 
history you checked out isn't the "real" history. So you can check out 
some old version that you made a release to a customer off, and may be 
totally unable to recreate the customer problem, because the release you 
checked out doesn't even compile any more!

You can actually do the same with most other SCM's. It may need somebody 
who is actually malicious, but even that isn't necessarily the case. Lots 
of SCM's don't have any checksums *at*all* on their data - the only way 
you'd ever know that something bad happened and you had disk corruption, 
is when you check something out and it just looks corrupted!

In other words, in a lot of SCM's, you're actually *lucky* if the 
corruption is so serious that it's not just a subtle "data is wrong" 
thing, it's so pervasive that you actually get an error from the SCM.

In git, every *single* piece of data is not just checksummed, it's 
CHECKSUMMED. Yeah, we use CRC's and Adler32 for some things, but even 
those are actually *also* protected at a higher level by real 
cryptographic hashes. You simply *cannot* corrupt data by mistake and not 
know about it. You can lose it, you can corrupt it, but it *will* be 
noticed.

If that doesn't make you feel good about your data, I don't know what 
will. Git will not replace backups in any way, shape, or form (although 
you can obviously use git itself to _do_ those backups - the joy of 
distributed SCMS), but it will tell you when you *need* those backups. 

Guaranteed.

And I can tell you that that is actually very rare. I doubt *any* 
commercial SCM will come even close. They might have checksums, but 
nothing really strong. It might be a CRC or even weaker. Or it might be 
nothing at all (and sadly, that's the *common* case).

				Linus
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]