On Wed, Apr 28, 2010 at 03:12:07PM +0000, Sergio Callegari wrote: > it happened to me to read an older post by Jeff King about "multiblobs" > (http://kerneltrap.org/mailarchive/git/2008/4/6/1360014) and I was wandering > whether the idea has been abandoned for some reason or just put on hold. I am a little late getting to this thread, and I agree with a lot of what Avery said elsewhere, so I won't repeat what's been said. But after reading my own message that you linked and the rest of this thread, I wanted to note a few things. One is that many of the applications for these multiblobs are extremely varied, and many of them are vague and hand-waving. I think you really have to look at each application individually to see how a solution would fit. In my original email, I mentioned linear chunking of large blobs for: 1. faster inexact rename detection 2. better diffs of binary files I think (2) is now obsolete. Since that message, we now have textconv filters, which allow simple and fast diffs of large objects (in my example, I talked about exif tags on images. I now textconv the images into a text representation of the exif tags and diff those). And with textconv caching, we can do it on the fly without impacting how we represent the object in git (we don't even have to pull the original large blob out of storage at all, as the cache provide a look-aside table keyed by the object name). I also mentioned in that email that in theory we could diff individual chunks even if we don't understand their semantic meaning. In practice, I don't think this works. Most binary formats are going to involve not just linear chunking, but decoding the binary chunks into some human-readable form. So smart chunking isn't enough; you need a decoder, which is what a textconv filter does. For item (1), this is closely related to faster (and possibly better) delta compression. I say only possibly better, because in theory our delta algorithm should be finding something as simple as my example already. And for both of those cases, the upside is a speed increase, but the downside is a breakage of the user-visible git model (i.e., blobs get different sha1's depending on how they've been split). But being two years wiser than when I wrote the original message, I don't think that breakage is justified. Instead, you should retain the simple git object model, and consider on-the-fly content-specific splits. In other words, at rename (or delta) time notice that blob 123abc is a PDF, and that it can be intelligently split into several chunks, and then look for other files which share chunks with it. As a bonus, this sort of scheme is very easy to cache, just as textconv is. You cache the smart-split of the blob, which is immutable for some blob/split-scheme combination. And then you can even do rename detection on large blob 123abc without even retrieving it from storage. Another benefit is that you still _store_ the original (you just don't look at it as often). Which means there is no annoyance with perfectly reconstructing a file. I had originally envisioned straight splitting, with concatenation as the reverse operation. But I have seen things like zip and tar files mentioned in this thread. They are quite challenging, because it is difficult to reproduce them byte-for-byte. But if you take the splitting out of the git data model, then that problem just goes away. The other application I saw in this thread is structured files where you actually _want_ to see all of the innards as individual files (e.g., being able to do "git show HEAD:foo.zip/file.txt"). And for those, I don't think any sort of automated chunking is really desirable. If you want git to store and process those files individually, then you should provide them to git individually. In other words, there is no need for git to know or care at all that "foo.zip" exists, but you should simply feed it a directory containing the files. The right place to do that conversion is either totally outside of git, or at the edges of git (i.e., git-add and when git places the file in the repository). Our current hooks may not be sufficient, but that means those hooks should be improved, which to me is much more favorable than a scheme that alters the core of the git data model. So no, reading my original message, I don't think it was a good idea. :) The things people want to accomplish are reasonable goals, but there are better ways to go about it. -Peff -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html