Re: Multiblobs

Sergio <sergio.callegari@xxxxxxxxx> · Wed, 28 Apr 2010 23:26:27 +0000 (UTC)

Avery Pennarun <apenwarr <at> gmail.com> writes:

> But why not use a .gitattributes filter to recompress the zip/odp file
> with no compression, as I suggested?  Then you can just dump the whole
> thing into git directly.  When you change the file, only the changes
> need to be stored thanks to delta compression.  Unless your
> presentation is hundreds of megs in size, git should be able to handle
> that just fine already.

Actually, I'm doing so...  But in some occasions odf file that share many
components do not delta, even when passed through a filter that uncompresses
them. Multiblobs are like taking advantage of a known structure to get better
deltas.

> But then you're digging around inside the pdf file by hand, which is a
> lot of pdf-specific work that probably doesn't belong inside git.

I perfectly agree that git should not know about the inner structure of things
like PDFs, Zips, Tars, Jars, whatever. But having an infrastructure allowing
multiblobs and attributes like clean/smudge to trigger creation and use of
multiblobs with user provided split/unsplit drivers could be nice.

> Worse, because compression programs don't always produce the same
> output, this operation would most likely actually *change* the hash of
> your pdf file as you do it. 

This should depend on the split/unsplit driver that you write. If your driver
stores a sufficient amount of metadata about the streams and their order, you
should be able to recreate the original file.

> In what way?  I doubt you'd get more efficient storage, at least.
> Git's deltas are awfully hard to beat.

Using the known structure of the file, you automatically identify the bits that
are identical and you save the need to find a delta altogether.

> > I agree... but there could be just a mere couple of gitattributes
multiblobsplit
> > and multiblobcompose, so that one could provide his own splitting and
composing
> > methods for the types of files he is interested in (and maybe contribute
them to
> > the community).
> 
> I guess this would be mostly harmless; the implementation could mirror
> the filter stuff.

This is exactly what I was thinking of: multiblobs as a generalization of the
filter infrastructure.

> In that case, I'd like to see some comparisons of real numbers
> (memory, disk usage, CPU usage) when storing your openoffice documents
> (using the .gitattributes filter, of course).  I can't really imagine
> how splitting the files into more pieces would really improve disk
> space usage, at least.

I'll try to isolate test cases, making test repos:

a) with 1 odf file changing a little on each checkin
b) the same storing the odf file with no compression with a suitable filter
c) the same storing the tree inside the odf file.

> Having done some tests while writing bup, my experience has been that
> chunking-without-deltas is great for these situations:
> 1) you have the same data shared across *multiple* files (eg. the same
> images in lots of openoffice documents with different filenames);
> 2) you have the same data *repeated* in the same file at large
> distances (so that gzip compression doesn't catch it; eg. VMware
> images)
> 3) your file is too big to work with the delta compressor (eg. VMware images).

An aside: bup is great!!! Thanks!

And thanks for all your comments, of course!

Sergio

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html