Re: Fwd: Git and Large Binaries: A Proposed Solution

Eric Montellese <emontellese@xxxxxxxxx> · Sat, 12 Mar 2011 20:53:53 -0500

This is a good point.

The best solution, it seems, has two parts:

1. Clean up the way in which git considers, diffs, and stores binaries
to cut down on the overhead of dealing with these files.
  1.1 Perhaps a "binaries" directory, or structure of directories, within .git
  1.2 Perhaps configurable options for when and how to try a binary
diff?  (allow user to decide if storage or speed is more important)
2. Once (1) is accomplished, add an option to avoid copying binaries
from all but the tip when doing a "git clone."
  2.1 The default behavior would be to copy everything, as users
currently expect.
  2.2 Core code would have hooks to allow a script to use a central
location for the binary storage. (ssh, http, gmail-fs, whatever)

(of course, the implementation of (1) should be friendly to the addition of (2))

Obviously, the major drawback to (2) without (2.2) is that if there is
truly distributed work going on, some clone-of-a-clone may not know
where to get the binaries.

But, print a warning when turning on the non-default behavior (2.1),
then it's a user problem :-)

Eric

On Thu, Mar 10, 2011 at 5:24 PM, Jeff King <peff@xxxxxxxx> wrote:
>
> On Thu, Mar 10, 2011 at 10:02:53PM +0100, Alexander Miseler wrote:
>
> > I've been debating whether to resurrect this thread, but since it has
> > been referenced by the SoC2011Ideas wiki article I will just go ahead.
> > I've spent a few hours trying to make this work to make git with big
> > files usable under Windows.
> >
> > > Just a quick aside.  Since (a2b665d, 2011-01-05) you can provide
> > > the filename as an argument to the filter script:
> > >
> > >     git config --global filter.huge.clean huge-clean %f
> > >
> > > then use it in place:
> > >
> > >     $ cat >huge-clean
> > >     #!/bin/sh
> > >     f="$1"
> > >     echo orig file is "$f" >&2
> > >     sha1=`sha1sum "$f" | cut -d' ' -f1`
> > >     cp "$f" /tmp/big_storage/$sha1
> > >     rm -f "$f"
> > >     echo $sha1
> > >
> > >             -- Pete
>
> After thinking about this strategy more (the "convert big binary files
> into a hash via clean/smudge filter" strategy), it feels like a hack.
> That is, I don't see any reason that git can't give you the equivalent
> behavior without having to resort to bolted-on scripts.
>
> For example, with this strategy you are giving up meaningful diffs in
> favor of just showing a diff of the hashes. But git _already_ can do
> this for binary diffs.  The problem is that git unnecessarily uses a
> bunch of memory to come up with that answer because of assumptions in
> the diff code. So we should be fixing those assumptions. Any place that
> this smudge/clean filter solution could avoid looking at the blobs, we
> should be able to do the same inside git.
>
> Of course that leaves the storage question; Scott's git-media script has
> pluggable storage that is backed by http, s3, or whatever. But again,
> that is a feature that might be worth putting into git (even if it is
> just a pluggable script at the object-db level).
>
> -Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html