Re: Problem with large files on different OSes

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Wed, 27 May 2009 15:07:49 -0700 (PDT)

On Wed, 27 May 2009, Jeff King wrote:
>
> Linus' "split into multiple objects" approach means you could perhaps
> split intelligently into metadata and "uninteresting data" sections
> based on the file type.

I suspect you wouldn't even need to. A regular delta algorithm would just 
work fairly well to find the common parts.

Sure, if the offset of the data changes a lot, then you'd miss all the 
deltas between two (large) objects that now have data that traverses 
object boundaries, but especially if the split size is pretty large (ie 
several tens of MB, possibly something like 256M), that's still going to 
be a pretty rare event.

IOW, imagine that you have a big file that is 2GB in size, and you prepend 
100kB of data to it (that's why it's so big - you keep prepending data to 
it as some kind of odd ChangeLog file). What happens? It would still delta 
fairly well, even if the delta's would now be:

 - 100kB of new data
 - 256M - 100kB of old data as a small delta entry

and the _next_ chunk woul be:

 - 100kB of "new" data (old data from the previous chunk)
 - 256M - 100kB of old data as a small delta entry

.. and so on for each chunk. So if the whole file is 2GB, it would be 
roughly 8 256MB chunks, and it would delta perfectly well: except for the 
overlap, that would now be 8x 100kB "slop" deltas.

So even a totally unmodified delta algorithm would shrink down the two 
copies of a ~2GB file to one copy + 900kB of extra delta.

Sure, a perfect xdelta thing that would have treated it as one huge file 
would have had just 100kB of delta data, but 900kB would still be a *big* 
saving over duplicating the whole 2GB.

> That would make things like rename detection very fast. Of course it has 
> the downside that you are cementing whatever split you made into history 
> for all time. And it means that two people adding the same content might 
> end up with different trees. Both things that git tries to avoid.

It's the "I can no longer see that the files are the same by comparing 
SHA1's" that I personally dislike.

So my "fixed chunk" approach would be nice in that if you have this kind 
of "chunkblob" entry, in the tree (and index) it would literally be one 
entry, and look like that:

   100644 chunkblob <sha1>

so you could compare two trees that have the same chunkblob entry, and 
just see that they are the same without ever looking at the (humongous) 
data.

The <chunkblob> type itself would then look like just an array of SHA1's, 
ie it would literally be an object that only points to other blobs. Kind 
of a "simplified tree object", if you will.

I think it would fit very well in the git model. But it's a nontrivial 
amount of changes.

			Linus
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html