Re: Multiblobs

Jeff King <peff@xxxxxxxx> · Thu, 6 May 2010 02:26:44 -0400

On Wed, Apr 28, 2010 at 03:12:07PM +0000, Sergio Callegari wrote:

> it happened to me to read an older post by Jeff King about "multiblobs"
> (http://kerneltrap.org/mailarchive/git/2008/4/6/1360014) and I was wandering
> whether the idea has been abandoned for some reason or just put on hold.

I am a little late getting to this thread, and I agree with a lot of
what Avery said elsewhere, so I won't repeat what's been said. But after
reading my own message that you linked and the rest of this thread, I
wanted to note a few things.

One is that many of the applications for these multiblobs are extremely
varied, and many of them are vague and hand-waving. I think you really
have to look at each application individually to see how a solution
would fit. In my original email, I mentioned linear chunking of large
blobs for:

  1. faster inexact rename detection

  2. better diffs of binary files

I think (2) is now obsolete. Since that message, we now have textconv
filters, which allow simple and fast diffs of large objects (in my
example, I talked about exif tags on images. I now textconv the images
into a text representation of the exif tags and diff those). And with
textconv caching, we can do it on the fly without impacting how we
represent the object in git (we don't even have to pull the original
large blob out of storage at all, as the cache provide a look-aside
table keyed by the object name).

I also mentioned in that email that in theory we could diff individual
chunks even if we don't understand their semantic meaning. In practice,
I don't think this works. Most binary formats are going to involve not
just linear chunking, but decoding the binary chunks into some
human-readable form. So smart chunking isn't enough; you need a decoder,
which is what a textconv filter does.

For item (1), this is closely related to faster (and possibly better)
delta compression. I say only possibly better, because in theory our
delta algorithm should be finding something as simple as my example
already.

And for both of those cases, the upside is a speed increase, but the
downside is a breakage of the user-visible git model (i.e., blobs get
different sha1's depending on how they've been split). But being two
years wiser than when I wrote the original message, I don't think that
breakage is justified. Instead, you should retain the simple git object
model, and consider on-the-fly content-specific splits. In other words,
at rename (or delta) time notice that blob 123abc is a PDF, and that it
can be intelligently split into several chunks, and then look for other
files which share chunks with it. As a bonus, this sort of scheme is
very easy to cache, just as textconv is. You cache the smart-split of
the blob, which is immutable for some blob/split-scheme combination. And
then you can even do rename detection on large blob 123abc without even
retrieving it from storage.

Another benefit is that you still _store_ the original (you just don't
look at it as often). Which means there is no annoyance with perfectly
reconstructing a file. I had originally envisioned straight splitting,
with concatenation as the reverse operation. But I have seen things like
zip and tar files mentioned in this thread. They are quite challenging,
because it is difficult to reproduce them byte-for-byte. But if you take
the splitting out of the git data model, then that problem just goes
away.

The other application I saw in this thread is structured files where you
actually _want_ to see all of the innards as individual files (e.g.,
being able to do "git show HEAD:foo.zip/file.txt"). And for those, I
don't think any sort of automated chunking is really desirable. If you
want git to store and process those files individually, then you should
provide them to git individually. In other words, there is no need for
git to know or care at all that "foo.zip" exists, but you should simply
feed it a directory containing the files. The right place to do that
conversion is either totally outside of git, or at the edges of git
(i.e., git-add and when git places the file in the repository). Our
current hooks may not be sufficient, but that means those hooks should
be improved, which to me is much more favorable than a scheme that
alters the core of the git data model.

So no, reading my original message, I don't think it was a good idea. :)
The things people want to accomplish are reasonable goals, but there are
better ways to go about it.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html