Avery Pennarun <apenwarr <at> gmail.com> writes: > But why not use a .gitattributes filter to recompress the zip/odp file > with no compression, as I suggested? Then you can just dump the whole > thing into git directly. When you change the file, only the changes > need to be stored thanks to delta compression. Unless your > presentation is hundreds of megs in size, git should be able to handle > that just fine already. Actually, I'm doing so... But in some occasions odf file that share many components do not delta, even when passed through a filter that uncompresses them. Multiblobs are like taking advantage of a known structure to get better deltas. > But then you're digging around inside the pdf file by hand, which is a > lot of pdf-specific work that probably doesn't belong inside git. I perfectly agree that git should not know about the inner structure of things like PDFs, Zips, Tars, Jars, whatever. But having an infrastructure allowing multiblobs and attributes like clean/smudge to trigger creation and use of multiblobs with user provided split/unsplit drivers could be nice. > Worse, because compression programs don't always produce the same > output, this operation would most likely actually *change* the hash of > your pdf file as you do it. This should depend on the split/unsplit driver that you write. If your driver stores a sufficient amount of metadata about the streams and their order, you should be able to recreate the original file. > In what way? I doubt you'd get more efficient storage, at least. > Git's deltas are awfully hard to beat. Using the known structure of the file, you automatically identify the bits that are identical and you save the need to find a delta altogether. > > I agree... but there could be just a mere couple of gitattributes multiblobsplit > > and multiblobcompose, so that one could provide his own splitting and composing > > methods for the types of files he is interested in (and maybe contribute them to > > the community). > > I guess this would be mostly harmless; the implementation could mirror > the filter stuff. This is exactly what I was thinking of: multiblobs as a generalization of the filter infrastructure. > In that case, I'd like to see some comparisons of real numbers > (memory, disk usage, CPU usage) when storing your openoffice documents > (using the .gitattributes filter, of course). I can't really imagine > how splitting the files into more pieces would really improve disk > space usage, at least. I'll try to isolate test cases, making test repos: a) with 1 odf file changing a little on each checkin b) the same storing the odf file with no compression with a suitable filter c) the same storing the tree inside the odf file. > Having done some tests while writing bup, my experience has been that > chunking-without-deltas is great for these situations: > 1) you have the same data shared across *multiple* files (eg. the same > images in lots of openoffice documents with different filenames); > 2) you have the same data *repeated* in the same file at large > distances (so that gzip compression doesn't catch it; eg. VMware > images) > 3) your file is too big to work with the delta compressor (eg. VMware images). An aside: bup is great!!! Thanks! And thanks for all your comments, of course! Sergio -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html