On Wed, 27 May 2009, Jeff King wrote: > > Linus' "split into multiple objects" approach means you could perhaps > split intelligently into metadata and "uninteresting data" sections > based on the file type. I suspect you wouldn't even need to. A regular delta algorithm would just work fairly well to find the common parts. Sure, if the offset of the data changes a lot, then you'd miss all the deltas between two (large) objects that now have data that traverses object boundaries, but especially if the split size is pretty large (ie several tens of MB, possibly something like 256M), that's still going to be a pretty rare event. IOW, imagine that you have a big file that is 2GB in size, and you prepend 100kB of data to it (that's why it's so big - you keep prepending data to it as some kind of odd ChangeLog file). What happens? It would still delta fairly well, even if the delta's would now be: - 100kB of new data - 256M - 100kB of old data as a small delta entry and the _next_ chunk woul be: - 100kB of "new" data (old data from the previous chunk) - 256M - 100kB of old data as a small delta entry .. and so on for each chunk. So if the whole file is 2GB, it would be roughly 8 256MB chunks, and it would delta perfectly well: except for the overlap, that would now be 8x 100kB "slop" deltas. So even a totally unmodified delta algorithm would shrink down the two copies of a ~2GB file to one copy + 900kB of extra delta. Sure, a perfect xdelta thing that would have treated it as one huge file would have had just 100kB of delta data, but 900kB would still be a *big* saving over duplicating the whole 2GB. > That would make things like rename detection very fast. Of course it has > the downside that you are cementing whatever split you made into history > for all time. And it means that two people adding the same content might > end up with different trees. Both things that git tries to avoid. It's the "I can no longer see that the files are the same by comparing SHA1's" that I personally dislike. So my "fixed chunk" approach would be nice in that if you have this kind of "chunkblob" entry, in the tree (and index) it would literally be one entry, and look like that: 100644 chunkblob <sha1> so you could compare two trees that have the same chunkblob entry, and just see that they are the same without ever looking at the (humongous) data. The <chunkblob> type itself would then look like just an array of SHA1's, ie it would literally be an object that only points to other blobs. Kind of a "simplified tree object", if you will. I think it would fit very well in the git model. But it's a nontrivial amount of changes. Linus -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html