On Wed, May 27, 2009 at 03:07:49PM -0700, Linus Torvalds wrote: > I suspect you wouldn't even need to. A regular delta algorithm would just > work fairly well to find the common parts. > > Sure, if the offset of the data changes a lot, then you'd miss all the > deltas between two (large) objects that now have data that traverses > object boundaries, but especially if the split size is pretty large (ie > several tens of MB, possibly something like 256M), that's still going to > be a pretty rare event. I confess that I'm not just interested in the _size_ of the deltas, but also speeding up deltification and rename detection. And I'm interested in files where we can benefit from their semantics a bit. So yes, with some overlap you would end up with pretty reasonable deltas for arbitrary binary, as you describe. But I was thinking something more like splitting a JPEG into a small first chunk that contains EXIF data, and a big secondary chunk that contains the actual image data. The second half is marked as not compressible (since it is already lossily compressed), and not interesting for deltification. When we consider two images for deltification, either: 1. they have the same "uninteresting" big part. In that case, you can trivially make a delta by just replacing the smaller first part (or even finding the optimal delta between the small parts). You never even need to look at the second half. 2. they don't have the same uninteresting part. You can reject them as delta candidates, because there is little chance the big parts will be related, even for a different version of the same image. And that extends to rename detection, as well. You can avoid looking at the big part at all if you assume big parts with differing hashes are going to be drastically different. > > That would make things like rename detection very fast. Of course it has > > the downside that you are cementing whatever split you made into history > > for all time. And it means that two people adding the same content might > > end up with different trees. Both things that git tries to avoid. > > It's the "I can no longer see that the files are the same by comparing > SHA1's" that I personally dislike. Right. I don't think splitting in the git data structure itself is worth it for that reason. But deltification and rename detection keeping a cache of smart splits that says "You can represent <sha-1> as this concatenation of <sha-1>s" means they can still get some advantage (over multiple runs, certainly, but possibly even over a single run: a smart splitter might not even have to look at the entire file contents). > So my "fixed chunk" approach would be nice in that if you have this kind > of "chunkblob" entry, in the tree (and index) it would literally be one > entry, and look like that: > > 100644 chunkblob <sha1> But if I am understanding you correctly, you _are_ proposing to munge the git data structure here. Which means that pre-chunkblob trees will point to the raw blob, and then post-chunkblob trees will point to the chunked representation. And that means not being able to use the sha-1 to see that they eventually point to the same content. -Peff -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html