On Fri, Jan 21, 2011 at 01:57:21PM -0500, Eric Montellese wrote: > I did a search for this issue before posting, but if I am touching on > an old topic that already has a solution in progress, I apologize. ÂAs It's been talked about a lot, but there is not exactly a solution in progress. One promising direction is not very different from what you're doing, though: > Solution: > The short version: > ***Don't track binaries in git. ÂTrack their hashes.*** Yes, exactly. But what your solution lacks, I think, is more integration into git. Specifically, using clean/smudge filters you can have git take care of tracking the file contents automatically. At the very simplest, it would look like: -- >8 -- cat >$HOME/local/bin/huge-clean <<'EOF' #!/bin/sh # In an ideal world, we could actually # access the original file directly instead of # having to cat it to a new file. temp="$(git rev-parse --git-dir)"/huge.$$ cat >"$temp" sha1=`sha1sum "$temp" | cut -d' ' -f1` # now move it to wherever your permanent storage is # scp "$root/$sha1" host:/path/to/big_storage/$sha1 cp "$temp" /tmp/big_storage/$sha1 rm -f "$temp" echo $sha1 EOF cat >$HOME/local/bin/huge-smudge <<'EOF' #!/bin/sh # Get sha1 from stored blob via stdin read sha1 # Now retrieve blob. We could optionally do some caching here. # ssh host cat /path/to/big/storage/$sha1 cat /tmp/big_storage/$sha1 EOF -- 8< -- Obviously our storage mechanism (throwing things in /tmp) is simplistic, but obviously you could store and retrieve via ssh, http, s3, or whatever. You can try it out like this: # set up our filter config and fake storage area mkdir /tmp/big_storage git config --global filter.huge.clean huge-clean git config --global filter.huge.smudge huge-smudge # now make a repo, and make sure we mark *.bin files as huge mkdir repo && cd repo && git init echo '*.bin filter=huge' >.gitattributes git add .gitattributes git commit -m 'add attributes' # let's do a moderate 20M file perl -e 'print "foo\n" for (1 .. 5000000)' >foo.bin git add foo.bin git commit -m 'add huge file (foo)' # and then another revision perl -e 'print "bar\n" for (1 .. 5000000)' >foo.bin git commit -a -m 'revise huge file (bar)' Notice that we just add and commit as normal. And we can check that the space usage is what you expect: $ du -sh repo/.git 196K repo/.git $ du -sh /tmp/big_storage 39M /tmp/big_storage Diffs obviously are going to be less interesting, as we just see the hash: $ git log --oneline -p foo.bin 39e549c revise huge file (bar) diff --git a/foo.bin b/foo.bin index 281fd03..70874bd 100644 --- a/foo.bin +++ b/foo.bin @@ -1 +1 @@ -50a1ee265f4562721346566701fce1d06f54dd9e +bbc2f7f191ad398fe3fcb57d885e1feacb4eae4e 845836e add huge file (foo) diff --git a/foo.bin b/foo.bin new file mode 100644 index 0000000..281fd03 --- /dev/null +++ b/foo.bin @@ -0,0 +1 @@ +50a1ee265f4562721346566701fce1d06f54dd9e but if you wanted to, you could write a custom diff driver that does something more meaningful with your particular binary format (it would have to grab from big_storage, though). Checking out other revisions works without extra action: $ head -n 1 foo.bin bar $ git checkout HEAD^ HEAD is now at 845836e... add huge file (foo) $ head -n 1 foo.bin foo And since you have the filter config in your ~/.gitconfig, clones will just work: $ git clone repo other $ du -sh other/.git 204K other/.git $ du -sh other/foo.bin 20M So conceptually it's pretty similar to yours, but the filter integration means that git takes care of putting the right files in place at the right time. It would probably benefit a lot from caching the large binary files instead of hitting big_storage all the time. And probably the putting/getting from storage should be factored out so you can plug in different storage. And it should all be configurable. Different users of the same repo might want different caching policies, or to access the binary assets by different mechanisms or URLs. > I imagine these features (among others): > > 1. In my current setup, each large binary file has a different name (a > revision number). ÂThis could be easily solved, however, by generating > unique names under the hood and tracking this within git. In the scheme above, we just index by their hash. So you can easily fsck your big_storage by making sure everything matches its hash (but you can't know that you have _all_ of the blobs needed unless you cross-reference with the history). > 2. A lot of the steps in my current setup are manual. ÂWhen I want to > add a new binary file, I need to manually create the hash and manually > upload the binary to the joint server. ÂIf done within git, this would > be automatic. I think the scheme above takes care of the manual bits. > 3. In my setup, all of the binary files are in a single "binrepo" > directory. ÂIf done from within git, we would need a non-kludgey way > to allow large binaries to exist anywhere within the git tree. ÂIf git Any scheme, whether it uses clean/smudge filters or not, should probably tie in via gitattributes. > 4. User option to download all versions of all binaries, or only the > version necessary for the position on the current branch. ÂIf you want > to be able to run all versions of the repository when offline, you can > download all versions of all binaries. ÂIf you don't need to do this, > you can just download the versions you need. Or perhaps have the > option to download all binaries smaller than X-bytes, but skip the big > ones. The scheme above will download on an as-needed basis. If caching were implemented, you could just make the cache infinitely big and do a "git log -p" which would download everything. :) Probably you would also want the smudge filter to return "blob not available" when operating in some kind of offline mode. > 5. Command to purge all binaries in your "binrepo" that are not needed > for the current revision (if you're running out of disk space > locally). In my scheme, just rm your cache directory (once it exists). > 6. Automatically upload new versions of files to the "binrepo" (rather > than needing to do this manually) Handled by the clean filter above. So obviously this is not very complete. And there are a few changes to git that could make it more efficient (e.g., letting the clean filter touch the file directly instead of having to make a copy via stdin). But the general idea is there, and it just needs somebody to make a nice polished script that is configurable, does caching, etc. I'll get to it eventually, but if you'd like to work on it, be my guest. -Peff -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html