I did a search for this issue before posting, but if I am touching on an old topic that already has a solution in progress, I apologize. As far as I know, this is still an open issue (last i saw in the kernel-trap archives was a "git and binary files" thread from Jan 2008, and there are a couple of promising related works (git annex and git bigfiles) -- but nothing that solves the complete problem). I'm interested in hearing your thoughts and suggestions, and interested if there is community interest in adding this feature to git. I would be happy to be involved in making the changes, but I have very limited time, so would prefer help and would like to know that it has a strong chance of joining the main line before starting... To whet your appetite to read all of the below (I know it's long), this is the root of the solution: --- Don't track binaries in git. Track their hashes. --- Problem Background: I work on embedded system software, with code for these products delivered from multiple customers and in multiple formats, for example: 1. source code -- works great and is what git is designed for 2. zipped tarballs of source code (that I will never need to modify) -- I could unpack these and then use git to track the source code. However, I prefer to track these deliverables as the tarballs themselves because it makes my customer happier to see the exact tarball that they delivered being used when I repackage updates. (Let's not discuss problems with this model - I understand that this is non-ideal). 3. large and/or many binaries. (could be pictures, short videos, pre-compiled binaries, etc) The problem of course is that, as you know, git is not ideal for handling large, or many, binaries. It's better at just about everything else, of course, but not this, for largely these two reasons: 1. git cannot diff the binaries to their previous iterations effectively to save space. (and neither can any tool) 2. git requires that all clones of that repository must therefore download all versions of all binaries -- which, if the binaries are large, or many, and are poorly compressed together (as stated in 1), this will be a very expensive operation. Problem Statement: We (the git user) want and "need" to be able to track large binaries from within our repository. But putting them into git slows down git unnecessarily. The only current alternative is to *not* check the large binaries into git -- but now they are no longer tracked, which is unacceptable. If I want to jump back in git to a point in the tree from 6 months ago, I do not have any way to tell which version of the large binaries I need. I could keep track of this manually, of course, but that's what git is for... Solution: The short version: ***Don't track binaries in git. Track their hashes.*** Solution: The long version: For my current project, I have this (the "store the hashes" idea) implemented outside of git. I am posting to this list because I would like to see this functionality (well, something even better) become native to git, and believe that it would remove one of the few remaining arguments that some projects have against adopting git. Here is how I have it implemented: First the layout: my_git_project/binrepo/ -- binaries/ -- hashes/ -- symlink_to_hashes_file -- symlink_to_another_hashes_file within the "binrepo" (binary repository) there is a subdirectory for binaries, and a subdirectory for hashes. In the root of the 'binrepo' all of the files stored have a symlink to the current version of the hash. The "binaries" directory is .gitignore'd -- the hashes directory and the symlinks to the current hashes are maintained by git. Whenever I receive a new version of a large binary file from a customer, I put it into "binaries" and I create a new hash for that file in "hashes" and update the symlink to point to that hash. I 'git commit' and 'git push' those changes (this is fast since there is no large binary in the git repository). The other important factor is that I must put this large binary file somewhere accessible for others to download it. In this example, it is: my_git_server.net:/binrepo/ Then I have a bash script (some psuedocode here to save space): for (all BINFILE in binrepo) ; do HASHFILE=$BINFILE".md5" # check if the binary exists if [[ -e binaries/$BINFILE ]] ; then echo " $BINFILE available" else echo " $BINFILE not available. Downloading..." wget http://my_git_server.net:/binrepo/$BINFILE fi # check md5sum md5sum $BINFILE > temp.md5 if ! diff -q ../hashes/$HASHFILE temp.md5 >/dev/null ; then echo "ERROR! $BINFILE md5 does not match!" exit and/or redownload fi done This confirms that I have the right version of all of the binaries -- my git repository is effectively tracking the large binaries, but without actually storing them internally to the git repo. If someone else updates the "binrepo" I will know it when I do a "git pull" and I will automatically get the right version of the binary file so that my sandbox is up-to-date. Now let's say I want to revert the version of the large binary file to the previous version -- all I need to do is to to edit the symlink in "binrepo", commit, and push. Other users will automatically use the old version of the file as well after they do their pull (and without needing to re-download that file) Summary of Big Advantages: 1. Repository is unpolluted by large binary files. git clone stays fast. 2. User has access to any version of any binary file, but does not need to store every version locally if they do not want to. 3. Git does not need to worry about the big binaries - there are no slow attempts to calculate binary deltas or pack and unpack under the hood. Improvements: I imagine these features (among others): 1. In my current setup, each large binary file has a different name (a revision number). This could be easily solved, however, by generating unique names under the hood and tracking this within git. 2. A lot of the steps in my current setup are manual. When I want to add a new binary file, I need to manually create the hash and manually upload the binary to the joint server. If done within git, this would be automatic. 3. In my setup, all of the binary files are in a single "binrepo" directory. If done from within git, we would need a non-kludgey way to allow large binaries to exist anywhere within the git tree. If git handles the "binrepo" under the hood though, the user would never need to know about it -- instead git would just handle all binaries by checking the internal "binrepo" Instead of tracking symlinks, git would track the file versions in the normal way -- it just wouldn't store the binaries the same way (instead it would store the hash) 4. User option to download all versions of all binaries, or only the version necessary for the position on the current branch. If you want to be able to run all versions of the repository when offline, you can download all versions of all binaries. If you don't need to do this, you can just download the versions you need. Or perhaps have the option to download all binaries smaller than X-bytes, but skip the big ones. 5. Command to purge all binaries in your "binrepo" that are not needed for the current revision (if you're running out of disk space locally). 6. Automatically upload new versions of files to the "binrepo" (rather than needing to do this manually) Rock on! Eric -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html