Re: Fwd: Git and Large Binaries: A Proposed Solution

Jeff King <peff@xxxxxxxx> · Fri, 21 Jan 2011 17:24:42 -0500

On Fri, Jan 21, 2011 at 01:57:21PM -0500, Eric Montellese wrote:

> I did a search for this issue before posting, but if I am touching on
> an old topic that already has a solution in progress, I apologize. ÂAs

It's been talked about a lot, but there is not exactly a solution in
progress. One promising direction is not very different from what you're
doing, though:

> Solution:
> The short version:
> ***Don't track binaries in git. ÂTrack their hashes.***

Yes, exactly. But what your solution lacks, I think, is more integration
into git. Specifically, using clean/smudge filters you can have git take
care of tracking the file contents automatically.

At the very simplest, it would look like:

-- >8 --
cat >$HOME/local/bin/huge-clean <<'EOF'
#!/bin/sh

# In an ideal world, we could actually
# access the original file directly instead of
# having to cat it to a new file.
temp="$(git rev-parse --git-dir)"/huge.$$
cat >"$temp"
sha1=`sha1sum "$temp" | cut -d' ' -f1`

# now move it to wherever your permanent storage is
# scp "$root/$sha1" host:/path/to/big_storage/$sha1
cp "$temp" /tmp/big_storage/$sha1
rm -f "$temp"

echo $sha1
EOF

cat >$HOME/local/bin/huge-smudge <<'EOF'
#!/bin/sh

# Get sha1 from stored blob via stdin
read sha1

# Now retrieve blob. We could optionally do some caching here.
# ssh host cat /path/to/big/storage/$sha1
cat /tmp/big_storage/$sha1
EOF
-- 8< --

Obviously our storage mechanism (throwing things in /tmp) is simplistic,
but obviously you could store and retrieve via ssh, http, s3, or
whatever.

You can try it out like this:

  # set up our filter config and fake storage area
  mkdir /tmp/big_storage
  git config --global filter.huge.clean huge-clean
  git config --global filter.huge.smudge huge-smudge

  # now make a repo, and make sure we mark *.bin files as huge
  mkdir repo && cd repo && git init
  echo '*.bin filter=huge' >.gitattributes
  git add .gitattributes
  git commit -m 'add attributes'

  # let's do a moderate 20M file
  perl -e 'print "foo\n" for (1 .. 5000000)' >foo.bin
  git add foo.bin
  git commit -m 'add huge file (foo)'

  # and then another revision
  perl -e 'print "bar\n" for (1 .. 5000000)' >foo.bin
  git commit -a -m 'revise huge file (bar)'

Notice that we just add and commit as normal.  And we can check that the
space usage is what you expect:

  $ du -sh repo/.git
  196K    repo/.git
  $ du -sh /tmp/big_storage
  39M     /tmp/big_storage

Diffs obviously are going to be less interesting, as we just see the
hash:

  $ git log --oneline -p foo.bin
  39e549c revise huge file (bar)
  diff --git a/foo.bin b/foo.bin
  index 281fd03..70874bd 100644
  --- a/foo.bin
  +++ b/foo.bin
  @@ -1 +1 @@
  -50a1ee265f4562721346566701fce1d06f54dd9e
  +bbc2f7f191ad398fe3fcb57d885e1feacb4eae4e
  845836e add huge file (foo)
  diff --git a/foo.bin b/foo.bin
  new file mode 100644
  index 0000000..281fd03
  --- /dev/null
  +++ b/foo.bin
  @@ -0,0 +1 @@
  +50a1ee265f4562721346566701fce1d06f54dd9e

but if you wanted to, you could write a custom diff driver that does
something more meaningful with your particular binary format (it would
have to grab from big_storage, though).

Checking out other revisions works without extra action:

  $ head -n 1 foo.bin
  bar
  $ git checkout HEAD^
  HEAD is now at 845836e... add huge file (foo)
  $ head -n 1 foo.bin
  foo

And since you have the filter config in your ~/.gitconfig, clones will
just work:

  $ git clone repo other
  $ du -sh other/.git
  204K    other/.git
  $ du -sh other/foo.bin
  20M

So conceptually it's pretty similar to yours, but the filter integration
means that git takes care of putting the right files in place at the
right time.

It would probably benefit a lot from caching the large binary files
instead of hitting big_storage all the time. And probably the
putting/getting from storage should be factored out so you can plug in
different storage. And it should all be configurable. Different users of
the same repo might want different caching policies, or to access the
binary assets by different mechanisms or URLs.

> I imagine these features (among others):
> 
> 1. In my current setup, each large binary file has a different name (a
> revision number). ÂThis could be easily solved, however, by generating
> unique names under the hood and tracking this within git.

In the scheme above, we just index by their hash. So you can easily fsck
your big_storage by making sure everything matches its hash (but you
can't know that you have _all_ of the blobs needed unless you
cross-reference with the history).

> 2. A lot of the steps in my current setup are manual. ÂWhen I want to
> add a new binary file, I need to manually create the hash and manually
> upload the binary to the joint server. ÂIf done within git, this would
> be automatic.

I think the scheme above takes care of the manual bits.

> 3. In my setup, all of the binary files are in a single "binrepo"
> directory. ÂIf done from within git, we would need a non-kludgey way
> to allow large binaries to exist anywhere within the git tree. ÂIf git

Any scheme, whether it uses clean/smudge filters or not, should probably
tie in via gitattributes.

> 4. User option to download all versions of all binaries, or only the
> version necessary for the position on the current branch. ÂIf you want
> to be able to run all versions of the repository when offline, you can
> download all versions of all binaries. ÂIf you don't need to do this,
> you can just download the versions you need.  Or perhaps have the
> option to download all binaries smaller than X-bytes, but skip the big
> ones.

The scheme above will download on an as-needed basis. If caching were
implemented, you could just make the cache infinitely big and do a "git
log -p" which would download everything. :)

Probably you would also want the smudge filter to return "blob not
available" when operating in some kind of offline mode.

> 5. Command to purge all binaries in your "binrepo" that are not needed
> for the current revision (if you're running out of disk space
> locally).

In my scheme, just rm your cache directory (once it exists).

> 6. Automatically upload new versions of files to the "binrepo" (rather
> than needing to do this manually)

Handled by the clean filter above.

So obviously this is not very complete. And there are a few changes to
git that could make it more efficient (e.g., letting the clean filter
touch the file directly instead of having to make a copy via stdin). But
the general idea is there, and it just needs somebody to make a nice
polished script that is configurable, does caching, etc. I'll get to it
eventually, but if you'd like to work on it, be my guest.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html