Re: Advice on choosing git

Jonathan Nieder <jrnieder@xxxxxxxxx> · Wed, 12 May 2010 04:24:46 -0500

Hi,

Noah Silverman wrote:

> I'm looking for both a version control system and backup system.

I am fond of this question. :)

> I guess, that I need just keep some files backed up (and/or synced) as
> they're not "working projects".  I will add new documents and
> occasionally edit others, but no real need for versioning.

I suggest rsync or unison[1], and to use btrfs locally if you want
snapshots.  I don’t know a good tool for shared snapshots, but that is
probably my ignorance.

In my humble opinion, tools designed for tracking source code, like
git and bzr, are not appropriate for this task.  To illustrate this, I
have put some thoughts about how to cheat git into doing an okay job
in a footnote[4].

> Other files
> are working projects (possible with collaboration) and need active VCS. 

In very small projects, I believe any free DVCS will do.

What tools are you and your collaborators already comfortable with?
I hear it can be hard to unlearn habits from using Subversion when
getting started with Git.  Some other version control systems cater to
that transition better.

As projects scale in size, the speed differences between version
control systems start to matter.  I find myself making larger commits,
looking through history less, and checking email more often when using
certain systems.

> From what I have read, I will
> effectively have multiple copies of each item on my hard drive, thus
> eating up a lot of space (One of the "working file"and several in the
> .git directory.) If I have multiple changes to a file, then I have
> several full versions of it on my machine.

If your files are relatively compressible (or at least rsyncable) and
you pack your the repository occasionally, this should not be a
problem.  The relevant page[2] of the Pro Git book tells probably more
than you wanted to know about this.

Short summary: each file is initially stored in the .git directory as
a compressed file named after its content.  When asked to pack with
the "git gc"[3] command (or automatically if there are too many
unpacked objects around), git puts the data into a larger "pack file",
this time as a delta against some suitable similar blob.

For source code (which is already rather compressible), this tends to
work well.  My local git/.git object repository is about 2½ times the
size of the working copy.

> This could be a problem for
> a directory with 100GB or more, especially on a laptop with limited hard
> drive space.

Yes.  Actually, this point is why I replied.  Using a source code
management system as a backup system generally implies this weird
assumption that even the oldest revisions are always worth keeping.

With big, machine-generated files, that doesn’t make sense to me ---
it is better to be able to throw away some snapshots when you are
running low on space.

> 2) Sub-directory selection.  On my laptop, I only want a few
> sub-directories to be synced up.  I don't need my whole document tree,
> but just a few directories of things I work on.

It requires foresight, but you could use a separate filesystem for
this (possibly loop-mounted) if you want to keep snapshots.  With
some symlinks, this would not require changing the directory
structure.

> Any and all suggestions are welcome and appreciated.

Thanks for the food for thought.
Jonathan

[1] http://www.cis.upenn.edu/~bcpierce/unison/
[2] http://progit.org/book/ch9-4.html
[3] http://www.kernel.org/pub/software/scm/git/docs/git-gc.html
[4]
So, you want to use git as a general backup tool?

 . Files should be compressible.  Set appropriate attributes.  Use
   clean and smudge filters[5] to replace the weird working-copy
   representation with a simpler tracked form.  Use !delta[6] where
   appropriate so git knows not to waste its time.

 . Files should be conducive to de-duplication.  Cut large files
   into slices using rsync’s rolling checksum algorithm[7].

 . Backups should be fault-tolerant.  Use par2[8] or zfec[9] to
   protect pack files, maybe.

 . Sometimes metadata (file owners and modes) is important.  Track a
   "restore" script that sets the appropriate metadata, and update it
   before each commit[10].

 . Files should not change as git reads them (or it will error
   out).  Wait for a quiescent state to backup, or make a
   snapshot some other way and ask git to back up that.

 . Old revisions are not precious.  It would be nice to be able to
   decide when each backed-up tree can expire.  My best suggestion is
   to rely on reflogs[11] instead of the revision graph to represent
   your history so old versions can expire, but getting this to work
   nicely would take some work: there is no built-in mechanism to
   transfer reflogs and associated objects to another repository, for
   example.

[5] http://www.kernel.org/pub/software/scm/git/docs/gitattributes.html#_tt_filter_tt
[6] http://www.kernel.org/pub/software/scm/git/docs/gitattributes.html#_tt_delta_tt
[7] http://github.com/apenwarr/bup
[8] http://parchive.sourceforge.net/
[9] http://allmydata.org/trac/zfec
[10] http://kitenet.net/~joey/code/etckeeper/
[11] http://www.kernel.org/pub/software/scm/git/docs/git-reflog.html
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html