Re: Fwd: Git and Large Binaries: A Proposed Solution

Eric Montellese <emontellese@xxxxxxxxx> · Fri, 21 Jan 2011 18:15:37 -0500

Peff,

Thanks for your insight -- this looks great.

Once something like this is available and more polished, what's the
process to request that it join the main line of git development?   (i
know functionally there's "no main line" in git... but you know what I
mean)

Has there already been discussion to this effect?  I do think that a
fix like this would improve git adoption among certain groups.  (I
know I've heard the "big binaries" problem mentioned at least a few
times)

I haven't dug around in git code yet, so while I can get the gist of
your code, I'm unable to get the complete picture.  You wouldn't
happen to have a git patch, or a public repo somewhere that I can take
a look at?  Does there happen to be a git developers guide hidden away
anywhere?  Though I have very limited time, I'd be happy to help out
as much as I can.

Eric

On Fri, Jan 21, 2011 at 5:24 PM, Jeff King <peff@xxxxxxxx> wrote:
> On Fri, Jan 21, 2011 at 01:57:21PM -0500, Eric Montellese wrote:
>
>> I did a search for this issue before posting, but if I am touching on
>> an old topic that already has a solution in progress, I apologize.  As
>
> It's been talked about a lot, but there is not exactly a solution in
> progress. One promising direction is not very different from what you're
> doing, though:
>
>> Solution:
>> The short version:
>> ***Don't track binaries in git.  Track their hashes.***
>
> Yes, exactly. But what your solution lacks, I think, is more integration
> into git. Specifically, using clean/smudge filters you can have git take
> care of tracking the file contents automatically.
>
> At the very simplest, it would look like:
>
> -- >8 --
> cat >$HOME/local/bin/huge-clean <<'EOF'
> #!/bin/sh
>
> # In an ideal world, we could actually
> # access the original file directly instead of
> # having to cat it to a new file.
> temp="$(git rev-parse --git-dir)"/huge.$$
> cat >"$temp"
> sha1=`sha1sum "$temp" | cut -d' ' -f1`
>
> # now move it to wherever your permanent storage is
> # scp "$root/$sha1" host:/path/to/big_storage/$sha1
> cp "$temp" /tmp/big_storage/$sha1
> rm -f "$temp"
>
> echo $sha1
> EOF
>
> cat >$HOME/local/bin/huge-smudge <<'EOF'
> #!/bin/sh
>
> # Get sha1 from stored blob via stdin
> read sha1
>
> # Now retrieve blob. We could optionally do some caching here.
> # ssh host cat /path/to/big/storage/$sha1
> cat /tmp/big_storage/$sha1
> EOF
> -- 8< --
>
> Obviously our storage mechanism (throwing things in /tmp) is simplistic,
> but obviously you could store and retrieve via ssh, http, s3, or
> whatever.
>
> You can try it out like this:
>
>  # set up our filter config and fake storage area
>  mkdir /tmp/big_storage
>  git config --global filter.huge.clean huge-clean
>  git config --global filter.huge.smudge huge-smudge
>
>  # now make a repo, and make sure we mark *.bin files as huge
>  mkdir repo && cd repo && git init
>  echo '*.bin filter=huge' >.gitattributes
>  git add .gitattributes
>  git commit -m 'add attributes'
>
>  # let's do a moderate 20M file
>  perl -e 'print "foo\n" for (1 .. 5000000)' >foo.bin
>  git add foo.bin
>  git commit -m 'add huge file (foo)'
>
>  # and then another revision
>  perl -e 'print "bar\n" for (1 .. 5000000)' >foo.bin
>  git commit -a -m 'revise huge file (bar)'
>
> Notice that we just add and commit as normal.  And we can check that the
> space usage is what you expect:
>
>  $ du -sh repo/.git
>  196K    repo/.git
>  $ du -sh /tmp/big_storage
>  39M     /tmp/big_storage
>
> Diffs obviously are going to be less interesting, as we just see the
> hash:
>
>  $ git log --oneline -p foo.bin
>  39e549c revise huge file (bar)
>  diff --git a/foo.bin b/foo.bin
>  index 281fd03..70874bd 100644
>  --- a/foo.bin
>  +++ b/foo.bin
>  @@ -1 +1 @@
>  -50a1ee265f4562721346566701fce1d06f54dd9e
>  +bbc2f7f191ad398fe3fcb57d885e1feacb4eae4e
>  845836e add huge file (foo)
>  diff --git a/foo.bin b/foo.bin
>  new file mode 100644
>  index 0000000..281fd03
>  --- /dev/null
>  +++ b/foo.bin
>  @@ -0,0 +1 @@
>  +50a1ee265f4562721346566701fce1d06f54dd9e
>
> but if you wanted to, you could write a custom diff driver that does
> something more meaningful with your particular binary format (it would
> have to grab from big_storage, though).
>
> Checking out other revisions works without extra action:
>
>  $ head -n 1 foo.bin
>  bar
>  $ git checkout HEAD^
>  HEAD is now at 845836e... add huge file (foo)
>  $ head -n 1 foo.bin
>  foo
>
> And since you have the filter config in your ~/.gitconfig, clones will
> just work:
>
>  $ git clone repo other
>  $ du -sh other/.git
>  204K    other/.git
>  $ du -sh other/foo.bin
>  20M
>
> So conceptually it's pretty similar to yours, but the filter integration
> means that git takes care of putting the right files in place at the
> right time.
>
> It would probably benefit a lot from caching the large binary files
> instead of hitting big_storage all the time. And probably the
> putting/getting from storage should be factored out so you can plug in
> different storage. And it should all be configurable. Different users of
> the same repo might want different caching policies, or to access the
> binary assets by different mechanisms or URLs.
>
>> I imagine these features (among others):
>>
>> 1. In my current setup, each large binary file has a different name (a
>> revision number).  This could be easily solved, however, by generating
>> unique names under the hood and tracking this within git.
>
> In the scheme above, we just index by their hash. So you can easily fsck
> your big_storage by making sure everything matches its hash (but you
> can't know that you have _all_ of the blobs needed unless you
> cross-reference with the history).
>
>> 2. A lot of the steps in my current setup are manual.  When I want to
>> add a new binary file, I need to manually create the hash and manually
>> upload the binary to the joint server.  If done within git, this would
>> be automatic.
>
> I think the scheme above takes care of the manual bits.
>
>> 3. In my setup, all of the binary files are in a single "binrepo"
>> directory.  If done from within git, we would need a non-kludgey way
>> to allow large binaries to exist anywhere within the git tree.  If git
>
> Any scheme, whether it uses clean/smudge filters or not, should probably
> tie in via gitattributes.
>
>> 4. User option to download all versions of all binaries, or only the
>> version necessary for the position on the current branch.  If you want
>> to be able to run all versions of the repository when offline, you can
>> download all versions of all binaries.  If you don't need to do this,
>> you can just download the versions you need.  Or perhaps have the
>> option to download all binaries smaller than X-bytes, but skip the big
>> ones.
>
> The scheme above will download on an as-needed basis. If caching were
> implemented, you could just make the cache infinitely big and do a "git
> log -p" which would download everything. :)
>
> Probably you would also want the smudge filter to return "blob not
> available" when operating in some kind of offline mode.
>
>> 5. Command to purge all binaries in your "binrepo" that are not needed
>> for the current revision (if you're running out of disk space
>> locally).
>
> In my scheme, just rm your cache directory (once it exists).
>
>> 6. Automatically upload new versions of files to the "binrepo" (rather
>> than needing to do this manually)
>
> Handled by the clean filter above.
>
>
> So obviously this is not very complete. And there are a few changes to
> git that could make it more efficient (e.g., letting the clean filter
> touch the file directly instead of having to make a copy via stdin). But
> the general idea is there, and it just needs somebody to make a nice
> polished script that is configurable, does caching, etc. I'll get to it
> eventually, but if you'd like to work on it, be my guest.
>
> -Peff
>
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html