On Sat, Apr 05, 2008 at 05:48:43PM -0700, Linus Torvalds wrote: > One thing that the git model sucks at is how it's not very good at > handling large objects. I've often wondered if I should have made "object" > be more fine-grained and tried to build up large files from multiple > smaller objects. > > [ That said, I think git does the right thing - for source code. The > blocking-up of files would cause a rather more complex model, and one of > the great things about git is how simple the basic model is. But the > large-file thing does mean that git potentially sucks really badly for > some other loads ] I have considered something like this for one of my repos, which is full of images. The large image data very rarely changes, but the small EXIF tags do. My thought was something like: - add a new object type, multiblob; a multiblob contains zero or more "child" sha1s, each of which is another multiblob or a blob. The data in the multiblob is an in-order concatenation of its children. - you would create multiblobs with a "smart" git-add that understands the filetype and splits the file accordingly (in my case, probably a chunk of headers and EXIF data, and then a chunk with the image data). - in most of git, whenever you need a blob, you just "unwrap" the multiblob to get the original blob data - because they're separate objects, pack-objects automagically does the right thing - a few places would benefit from handling multiblobs specially. In particular: - the diff machinery could do much more efficient comparisons for some inexact renames. E.g., multiblob "1234\n5678" and multiblob "abcd\n5678" could ignore the "5678" id. - the diff machinery could show diffs that were more human readable (e.g., even without understanding what the chunks of the multiblob _mean_, it can still say "most of this image didn't change, but this textual part did"). Of course there are a few drawbacks: - one of git's strengths is that content is the same no matter who adds it or how. Now the same file has a different sha1 as a multiblob versus a regular blob. - it breaks the git model of "we store state in the simplest way, and figure everything out afterwards." IOW, you are stuck with whatever crappy multiblob split you did when you added or updated the file. The usual pattern in git is "dumb add, smart view". Now maybe it is worth breaking this for two reasons: - dumb add, smart view is often very resource intensive; we can get smaller packs and faster rename detection out of this - we might be losing information; in the case of renames, we can justify not explicitly recording because we can figure out later what actually happened. I don't know if there is a multiblob split that would encapsulate useful user input. My EXIF example doesn't; with a little more CPU time, you could just do the automated split at diff or delta time. So it's an approach that I think would work, but I'm not sure it's worth the effort unless somebody comes up with a compelling reason that you can't just split the blobs up after the fact (and maybe the right approach is that pack v5 can split blobs intelligently to get better deltas, so they are still blobs, but we just store them differently). -Peff -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html