Re: Fwd: Git and Large Binaries: A Proposed Solution

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Sat, Mar 12, 2011 at 08:53:53PM -0500, Eric Montellese wrote:

> The best solution, it seems, has two parts:
> 
> 1. Clean up the way in which git considers, diffs, and stores binaries
> to cut down on the overhead of dealing with these files.

This is the easier half, I think.

>   1.1 Perhaps a "binaries" directory, or structure of directories, within .git

I'd rather not do something so drastic. We already have ways of marking
files as binary and un-diffable within the tree. So you can already do
pretty well with marking them with gitattributes. I think we can do
better by making them the binaryness auto-detection less expensive
(right now we pull in the whole blob to check the first 1K or so for
NULs or other patterns; this is fine in the common text case, where
we'll want the whole blob in a minute anyway, but for large files it's
obviously wasteful). There may also be code-paths for binary files where
we accidentally load them (I just fixed one last week where we
unnecessarily loaded them in the diffstat code path). Somebody will need
to do some experimenting to shake out those code paths.

For packing, we have core.bigFileThreshold to turn off delta compression
for large files, but according to the documentation, it is only honored
for fast-import. I think we would want something similar to say "for
some subset of files (indicated either by name or by minimum size),
don't bother with zlib-compression either, and always keep them loose".

Those are the two major ones, I think. There are probably a handful of
other cases (like git-add, which really should be able to have a fixed
memory size). Again, the first step is figuring out where all of the
problems are (and I'm happy to just fix them one by one as they come up,
but I am also thinking of this in terms of a GSoC project).

>   1.2 Perhaps configurable options for when and how to try a binary
> diff? Â(allow user to decide if storage or speed is more important)

We can already do that with gitattributes. But it would be nice to have
it be fast in the binary auto-detection case.

> 2. Once (1) is accomplished, add an option to avoid copying binaries
> from all but the tip when doing a "git clone."

This is much harder. :)

>   2.1 The default behavior would be to copy everything, as users
> currently expect.
>   2.2 Core code would have hooks to allow a script to use a central
> location for the binary storage. (ssh, http, gmail-fs, whatever)

I think we would need a protocol extension for the fetching client to
say "please don't bother sending me anything larger than N bytes; I will
get it via alternate storage". Although there are situations more
complicated than that. Your alternate storage might have up to commit X,
and you don't want large objects in X or its ancestors. But you _do_
want large objects in descendants of X, since you have no other way to
get them.

So you need some way of saying which sets of large objects you need and
which you don't. One implementation is that you could fetch from
alternate storage (which would then need to be not just large-blob
storage, but actually have a full repo), and then afterwards fetch from
the remote (which would then send you all binaries, because by
definition anything you are fetching is not something the alternate
storage has). That feels a bit hack-ish. Doing something more clever
would require a pretty major protocol extension, though.

I haven't been paying attention to any sparse clone proposals. I know it
has come up but I don't know how mature the idea is. But this is
potentially related.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]