Fwd: Git and Large Binaries: A Proposed Solution

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I did a search for this issue before posting, but if I am touching on
an old topic that already has a solution in progress, I apologize.  As
far as I know, this is still an open issue (last i saw in the
kernel-trap archives was a "git and binary files" thread from Jan
2008, and there are a couple of promising related works (git annex and
git bigfiles) -- but nothing that solves the complete problem).

I'm interested in hearing your thoughts and suggestions, and
interested if there is community interest in adding this feature to
git.  I would be happy to be involved in making the changes, but I
have very limited time, so would prefer help and would like to know
that it has a strong chance of joining the main line before
starting...
To whet your appetite to read all of the below (I know it's long),
this is the root of the solution:

---       Don't track binaries in git.  Track their hashes.       ---


Problem Background:
I work on embedded system software, with code for these products
delivered from multiple customers and in multiple formats, for
example:

1. source code -- works great and is what git is designed for
2. zipped tarballs of source code (that I will never need to modify)
-- I could unpack these and then use git to track the source code.
However, I prefer to track these deliverables as the tarballs
themselves because it makes my customer happier to see the exact
tarball that they delivered being used when I repackage updates.
(Let's not discuss problems with this model - I understand that this
is non-ideal).
3. large and/or many binaries.  (could be pictures, short videos,
pre-compiled binaries, etc)

The problem of course is that, as you know, git is not ideal for
handling large, or many, binaries.  It's better at just about
everything else, of course, but not this, for largely these two
reasons:

1. git cannot diff the binaries to their previous iterations
effectively to save space.  (and neither can any tool)
2. git requires that all clones of that repository must therefore
download all versions of all binaries -- which, if the binaries are
large, or many, and are poorly compressed together (as stated in 1),
this will be a very expensive operation.


Problem Statement:
We (the git user) want and "need" to be able to track large binaries
from within our repository.  But putting them into git slows down git
unnecessarily.
The only current alternative is to *not* check the large binaries into
git -- but now they are no longer tracked, which is unacceptable.  If
I want to jump back in git to a point in the tree from 6 months ago, I
do not have any way to tell which version of the large binaries I
need.  I could keep track of this manually, of course, but that's what
git is for...


Solution:
The short version:
***Don't track binaries in git.  Track their hashes.***

Solution:
The long version:
For my current project, I have this (the "store the hashes" idea)
implemented outside of git.  I am posting to this list because I would
like to see this functionality (well, something even better) become
native to git, and believe that it would remove one of the few
remaining arguments that some projects have against adopting git.
Here is how I have it implemented:

First the layout:
my_git_project/binrepo/
-- binaries/
-- hashes/
-- symlink_to_hashes_file
-- symlink_to_another_hashes_file
within the "binrepo" (binary repository) there is a subdirectory for
binaries, and a subdirectory for hashes.  In the root of the 'binrepo'
all of the files stored have a symlink to the current version of the
hash.
The "binaries" directory is .gitignore'd -- the hashes directory and
the symlinks to the current hashes are maintained by git.
Whenever I receive a new version of a large binary file from a
customer, I put it into "binaries" and I create a new hash for that
file in "hashes" and update the symlink to point to that hash.  I 'git
commit' and 'git push' those changes (this is fast since there is no
large binary in the git repository).
The other important factor is that I must put this large binary file
somewhere accessible for others to download it.  In this example, it
is:  my_git_server.net:/binrepo/

Then I have a bash script (some psuedocode here to save space):

for (all BINFILE in binrepo) ; do
  HASHFILE=$BINFILE".md5"
  # check if the binary exists
  if [[ -e binaries/$BINFILE ]] ; then
    echo "  $BINFILE available"
  else
    echo "  $BINFILE not available. Downloading..."
    wget http://my_git_server.net:/binrepo/$BINFILE
  fi
  # check md5sum
  md5sum $BINFILE > temp.md5
  if ! diff -q ../hashes/$HASHFILE temp.md5 >/dev/null ; then
    echo "ERROR! $BINFILE md5 does not match!"
    exit and/or redownload
  fi
done

This confirms that I have the right version of all of the binaries --
my git repository is effectively tracking the large binaries, but
without actually storing them internally to the git repo.  If someone
else updates the "binrepo" I will know it when I do a "git pull" and I
will automatically get the right version of the binary file so that my
sandbox is up-to-date.  Now let's say I want to revert the version of
the large binary file to the previous version -- all I need to do is
to to edit the symlink in "binrepo", commit, and push.  Other users
will automatically use the old version of the file as well after they
do their pull (and without needing to re-download that file)


Summary of Big Advantages:

1. Repository is unpolluted by large binary files.  git clone stays fast.
2. User has access to any version of any binary file, but does not
need to store every version locally if they do not want to.
3. Git does not need to worry about the big binaries - there are no
slow attempts to calculate binary deltas or pack and unpack under the
hood.


Improvements:

I imagine these features (among others):

1. In my current setup, each large binary file has a different name (a
revision number).  This could be easily solved, however, by generating
unique names under the hood and tracking this within git.
2. A lot of the steps in my current setup are manual.  When I want to
add a new binary file, I need to manually create the hash and manually
upload the binary to the joint server.  If done within git, this would
be automatic.
3. In my setup, all of the binary files are in a single "binrepo"
directory.  If done from within git, we would need a non-kludgey way
to allow large binaries to exist anywhere within the git tree.  If git
handles the "binrepo" under the hood though, the user would never need
to know about it -- instead git would just handle all binaries by
checking the internal "binrepo"  Instead of tracking symlinks, git
would track the file versions in the normal way -- it just wouldn't
store the binaries the same way (instead it would store the hash)
4. User option to download all versions of all binaries, or only the
version necessary for the position on the current branch.  If you want
to be able to run all versions of the repository when offline, you can
download all versions of all binaries.  If you don't need to do this,
you can just download the versions you need.  Or perhaps have the
option to download all binaries smaller than X-bytes, but skip the big
ones.
5. Command to purge all binaries in your "binrepo" that are not needed
for the current revision (if you're running out of disk space
locally).
6. Automatically upload new versions of files to the "binrepo" (rather
than needing to do this manually)


Rock on!
Eric
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]