Re: Problem with large files on different OSes

Jeff King <peff@xxxxxxxx> · Wed, 27 May 2009 17:53:14 -0400

On Wed, May 27, 2009 at 01:37:26PM -0400, Nicolas Pitre wrote:

> My idea for handling big files is simply to:
> 
>  1) Define a new parameter to determine what is considered a big file.
> 
>  2) Store any file larger than the treshold defined in (1) directly into 
>     a pack of their own at "git add" time.
> 
>  3) Never attempt to diff nor delta large objects, again according to 
>     (1) above.  It is typical for large files not to be deltifiable, and 
>     a diff for files in the thousands of megabytes cannot possibly be 
>     sane.

What about large files that have a short metadata section that may
change? Versions with only the metadata changed delta well, and with a
custom diff driver, can produce useful diffs. And I don't think that is
an impractical or unlikely example; large files can often be tagged
media.

Linus' "split into multiple objects" approach means you could perhaps
split intelligently into metadata and "uninteresting data" sections
based on the file type.  That would make things like rename detection
very fast. Of course it has the downside that you are cementing whatever
split you made into history for all time. And it means that two people
adding the same content might end up with different trees. Both things
that git tries to avoid.

I wonder if it would be useful to make such a split at _read_ time. That
is, still refer to the sha-1 of the whole content in the tree objects,
but have a separate cache that says "hash X splits to the concatenation
of Y,Z". Thus you can always refer to the "pure" object, both as a user,
and in the code. So we could avoid retrofitting all of the code -- just
some parts like diff might want to handle an object in multiple
segments.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html