Re: Achieving efficient storage of weirdly structured repos

Jeff King <peff@xxxxxxxx> · Sun, 6 Apr 2008 12:10:03 -0400

On Sat, Apr 05, 2008 at 05:48:43PM -0700, Linus Torvalds wrote:

> One thing that the git model sucks at is how it's not very good at 
> handling large objects. I've often wondered if I should have made "object" 
> be more fine-grained and tried to build up large files from multiple 
> smaller objects.
> 
> [ That said, I think git does the right thing - for source code. The 
>   blocking-up of files would cause a rather more complex model, and one of 
>   the great things about git is how simple the basic model is. But the
>   large-file thing does mean that git potentially sucks really badly for 
>   some other loads ]

I have considered something like this for one of my repos, which is full
of images. The large image data very rarely changes, but the small EXIF
tags do.

My thought was something like:

  - add a new object type, multiblob; a multiblob contains zero or more
    "child" sha1s, each of which is another multiblob or a blob. The
    data in the multiblob is an in-order concatenation of its children.

  - you would create multiblobs with a "smart" git-add that understands
    the filetype and splits the file accordingly (in my case, probably a
    chunk of headers and EXIF data, and then a chunk with the image
    data).

  - in most of git, whenever you need a blob, you just "unwrap" the
    multiblob to get the original blob data

  - because they're separate objects, pack-objects automagically does
    the right thing

  - a few places would benefit from handling multiblobs specially. In
    particular:
      - the diff machinery could do much more efficient comparisons for
        some inexact renames. E.g., multiblob "1234\n5678" and multiblob
        "abcd\n5678" could ignore the "5678" id.
      - the diff machinery could show diffs that were more human
        readable (e.g., even without understanding what the chunks of
        the multiblob _mean_, it can still say "most of this image
        didn't change, but this textual part did").

Of course there are a few drawbacks:

  - one of git's strengths is that content is the same no matter who
    adds it or how. Now the same file has a different sha1 as a
    multiblob versus a regular blob.

  - it breaks the git model of "we store state in the simplest way, and
    figure everything out afterwards." IOW, you are stuck with whatever
    crappy multiblob split you did when you added or updated the file.
    The usual pattern in git is "dumb add, smart view". Now maybe it is
    worth breaking this for two reasons:

      - dumb add, smart view is often very resource intensive; we can
        get smaller packs and faster rename detection out of this

      - we might be losing information; in the case of renames, we can
        justify not explicitly recording because we can figure out
        later what actually happened. I don't know if there is a
        multiblob split that would encapsulate useful user input.
        My EXIF example doesn't; with a little more CPU time, you could
        just do the automated split at diff or delta time.

So it's an approach that I think would work, but I'm not sure it's worth
the effort unless somebody comes up with a compelling reason that you
can't just split the blobs up after the fact (and maybe the right
approach is that pack v5 can split blobs intelligently to get better
deltas, so they are still blobs, but we just store them differently).

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html