Re: Problem with large files on different OSes

Nicolas Pitre <nico@xxxxxxx> · Wed, 27 May 2009 13:37:26 -0400 (EDT)

On Wed, 27 May 2009, Linus Torvalds wrote:

> Hmm. No. Looking at it some more, we could add some nasty code to do 
> _some_ things chunked (like adding a new file as a single object), but it 
> doesn't really help. For any kind of useful thing, we'd need to handle the 
> "read from pack" case in multiple chunks too, and that gets really nasty 
> really quickly.
> 
> The whole "each object as one allocation" design is pretty core, and it 
> looks pointless to have a few special cases, when any actual relevant use 
> would need a whole lot more than the few simple ones.
> 
> Git really doesn't like big individual objects.
> 
> I've occasionally thought about handling big files as multiple big 
> objects: we'd split them into a "pseudo-directory" (it would have some new 
> object ID), and then treat them as a magical special kind of directory 
> that just happens to be represented as one large file on the filesystem.
> 
> That would mean that if you have a huge file, git internally would never 
> think of it as one big file, but as a collection of many smaller objects. 
> By just making the point where you break up files be a consistent rule 
> ("always break into 256MB pieces"), it would be a well-behaved design (ie 
> things like behaviour convergence wrt the same big file being created 
> different ways).
> 
> HOWEVER.
> 
> While that would fit in the git design (ie it would be just a fairly 
> straightforward extension - another level of indirection, kind of the way 
> we added subprojects), it would still be a rewrite of some core stuff. The 
> actual number of lines might not be too horrid, but quite frankly, I 
> wouldn't want to do it personally. It would be a lot of work with lots of 
> careful special case handling - and no real upside for normal use.

My idea for handling big files is simply to:

 1) Define a new parameter to determine what is considered a big file.

 2) Store any file larger than the treshold defined in (1) directly into 
    a pack of their own at "git add" time.

 3) Never attempt to diff nor delta large objects, again according to 
    (1) above.  It is typical for large files not to be deltifiable, and 
    a diff for files in the thousands of megabytes cannot possibly be 
    sane.

The idea is to avoid ever needing to load such object's content entirely 
in memory. So with the data already in a pack, the pack data reuse logic 
(which already does its copy in chunks) could be triggered during a 
repack/fetch/push.

This is also quite trivial to implement with very few special cases, and 
then git would handle huge repositories with lots of huge files just as 
well as any other SCMs.  The usual git repository compactness won't be 
there of course, but I doubt people dealing with repositories in the 
hundreds of gigabytes really care.

Nicolas
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html