Re: Multiblobs

Avery Pennarun <apenwarr@xxxxxxxxx> · Wed, 28 Apr 2010 17:27:32 -0400

On Wed, Apr 28, 2010 at 3:13 PM, Sergio Callegari
<sergio.callegari@xxxxxxxxx> wrote:
> Avery Pennarun <apenwarr <at> gmail.com> writes:
>> I'm not sure it would help very much for these sorts of files.  The
>> problem is that compressed files tend to change a lot even if only a
>> few bytes of the original data have changed.
>
> Probably I have not provided enough elements... My idea is the following:
>
> If you store a structured file as a multiblob, you can use a blob for each
> uncompressed element of content.  For instance, when storing an opendocument
> file you could use a blob for manifest.xml, one for content.xml, etc... (try
> unzip -l on an odt or odp file to get an idea). When you edit your file only a
> few of these change. For instance, if we talk about a presentation, each slide
> has its own content.xml, so changing one slide only that changes.

But why not use a .gitattributes filter to recompress the zip/odp file
with no compression, as I suggested?  Then you can just dump the whole
thing into git directly.  When you change the file, only the changes
need to be stored thanks to delta compression.  Unless your
presentation is hundreds of megs in size, git should be able to handle
that just fine already.

> The same for PDF files, if you split them using a blob for each uncompressed
> stream, little variations of the pdf file will touch only a blob.

But then you're digging around inside the pdf file by hand, which is a
lot of pdf-specific work that probably doesn't belong inside git.
Worse, because compression programs don't always produce the same
output, this operation would most likely actually *change* the hash of
your pdf file as you do it.  (That's also true for openoffice files,
but at least those are just plain zip files, and zip files are
somewhat less of a special case.)

>> For things like opendocument, or uncompressed tars, you'd be better
>> off to decompress them (or recompress with zip -0) using
>> .gitattributes.  Generally these files aren't *so* large that they
>> really need to be chunked; what you want to do is improve the deltas,
>> which decompressing will do.
>
> This is what I currently do.  But using multiblobs would be a definite
> improvement over this.

In what way?  I doubt you'd get more efficient storage, at least.
Git's deltas are awfully hard to beat.

>> That sounds complicated and error prone, and is suspiciously like
>> Apple's "resource forks," which even Apple has mostly realized were a
>> bad idea.
>
> I did not mean the Apple way... Suppose that you need to store images with exif
> tags.  In order to diff them you would tipically set a textconv attribute, to
> see only the tags.  However, this kind of filter needs to read the whole file
> (expensive). BTW this is why a caching mechanism involving notes has recently
> been proposed. Now suppose that you can set up a rule so that image files with
> tags are stored as a multiblob. You can use 3 blobs... 1 as a header, one for
> the raw image data and one for the tags.  Now your textconv filter only needs to
> look at the content of the tags blob.

A resource fork by any other name is still a resource fork, and it's
still ugly.  If you really need something like this, just cache the
attributes in a file alongside the big file, and store both files in
the git repo.

> Similar... Right now to do package management with git, you need to use pristine
> tar. This is because when you check in the upstream tar you only check in its
> elements, not the whole tar.gz.  So you need pristine tar to recreate the
> upstream tar.gz whenever needed. But with multiblob you could store both the
> content /and/ the upstream tar and there would be minimal overlap since the
> blobs would be the same.

I guess.  For something like that, though, Debian's pristine-tarball
tool seems to already solve the problem and works with any VCS, not
just git.

>> Sharing the blobs of a tarball with a checked-out tree would require a
>> tar-specific chunking algorithm.  Not impossible, but a pain, and you
>> might have a hard time getting it accepted into git since it's
>> obviously not something you really need for a normal "source code"
>> tracking system.
>
> I agree... but there could be just a mere couple of gitattributes multiblobsplit
> and multiblobcompose, so that one could provide his own splitting and composing
> methods for the types of files he is interested in (and maybe contribute them to
> the community).

I guess this would be mostly harmless; the implementation could mirror
the filter stuff.

> I am not really thinking that much about large binary files (that would anyway
> come as a bonus - an many people often talk about them on the list), but of
> structured files that currently do not pack well.  My personal issue is with
> opendocument files, since I need to check in lots of documentation and
> presentation material.

In that case, I'd like to see some comparisons of real numbers
(memory, disk usage, CPU usage) when storing your openoffice documents
(using the .gitattributes filter, of course).  I can't really imagine
how splitting the files into more pieces would really improve disk
space usage, at least.

Having done some tests while writing bup, my experience has been that
chunking-without-deltas is great for these situations:
1) you have the same data shared across *multiple* files (eg. the same
images in lots of openoffice documents with different filenames);
2) you have the same data *repeated* in the same file at large
distances (so that gzip compression doesn't catch it; eg. VMware
images)
3) your file is too big to work with the delta compressor (eg. VMware images).

However, in my experience #1 is pretty rare and #2 and #3 aren't in
your use case.  And deltas-between-chunks is not very easy to do,
since it's hard to guess which chunks might be "similar" to which
other chunks.

Personally, I think it would be great if git could natively handle
large numbers of large binary files efficiently, because there are a
few use cases I would have for it.  But whenever I start investigating
my use cases, it always turns out that just "supporting large files"
is just the tip of the iceberg, and there's a huge submerged mass of
iceberg that becomes obvious as soon as you start crashing into it.

The bup use case (write-once, read-almost-never, incremental backups)
is a rare exception in which fixing *only* the file size problem has
produced useful results.

Have fun,

Avery
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html